We will now go from matching kernels to matching kernels for image pyramids. For example, having a multi-resolution pyramid of images and being able to use that idea to develop matching kernels. The slides have once again borrowed from the lectures of Professor Avrithis at Inria Rennes. Descriptor matching, as we just saw in the earlier lecture can be given by, you have Xc and similar Yc for the features that belong to a particular visual word in images X and image Y. Then a matching kernel for something like bag of words could be given by summation over the same cluster centroids just counting the number of features belonging to it. You could also include some weighting factor for each of these summations if required. And a more general form that we saw last time was what you see below here, which is K(X, Y ) = γ(X)γ(Y ) ∑ M(X , ) c∈C Wc c Y c Wc is a weight that we are introducing, which we can choose to use or not and M of Xc Yc, where M is the matching function. ( 01:45) Now, we will talk about going beyond single level matching and matching at the level of pyramids and we will describe a seminal work in this context known as pyramid match kernels. So pyramid matching is an efficient method that maps unordered feature sets, which is what each image is, each image is an unordered set of features. We are going to convert that into multi-resolution histograms and then do matching using weighted multi-resolution histograms. So we ideally can start with the finest resolution histogram cell where a matched pair first appears, and then we keep merging histograms as we go up the pyramid in this particular context. And the work has a very nice interpretation where it can be shown that it approximates a similarity in a partial matching setting, where if you had only a partial set of features in one image, match a feed, set of features in another image, pyramid match kernel approximates that optimal partial matching between those two images. For more details, I would also recommend you to read this paper called pyramid match kernel. It is written very well and explain some of these ideas in detail if you are interested in knowing more. Let us start with defining histogram intersection, because we are going to define histograms in both images. Obviously, we are going to define them at multiple levels, but let us just talk about how do you match histograms in this context. So if you had two histograms x and y of b bins each, so let us say this is x, and this is y, both are histograms with b bins each. We define the histogram intersection as minimum of xi, yi, one element of the histogram in both of these images and sum these up over all the b bins. So you take the first bin of histogram, the first bin of the second one, take the minimum value, take the minimum value of the next bin in the histogram and add them all. That is what we define as the history intersection. Interestingly, you can show that this notion of histogram intersection, which we define as Kappa sub-HI has a relation to L1 distance. We are not going to prove it here, but probably leave it as an exercise for you to look at. You can see that the L1 distance between two vectors, x and y can be given by ||x − y|| |x|| |y|| κ (x, ) 1 = | 1 + | 2 − 2 HI y Try it out for yourself. Take a few exits, take a couple of examples of x and y, you will actually see that it was in practice. Please try proving this also if you can. This is an interesting exercise for you to work out, but you can show that this histogram distance is related to the L1 distance. Remember L1 distance is the sum of absolute values of that vector. ( 05:16) Let us come back to the pyramid match kernel now. So we said that the pyramid match kernel does a weighted sum of histogram intersections at different levels of two images, and it approximates optimal pairwise matching. So we first conceptually talk about it and then we will give a concrete example and go how it is done. So if you had these two images of the same object from different poses, different view angles, you could have, once again, you have you extract key points and those key points could be these laying in R power d. You have a similar set of features lying in R power d for the second image. So you now, you have that entire feature space that you divide into a grid for instance. And now you are going to count how many of the features in one image occur in each of that grid of the feature space that is going to define a histogram. You match the histogram at that level. Then collapse the grid and merge regions in your grid in R power d in your D dimensional vector assuming this size of the descriptor corresponding to the feature, your match at that level, so on and so forth. And one intuition here is you want to give a higher weight to matches at a final level and a lower weight to matches at a higher level where the histogram bins maybe merged. We will give a concrete example and walk over this idea. ( 06:50) Let us consider now that you have a set of features, an unordered set of features in image X, which is given by these blue points, a similar unordered set of features in image Y, even by the red points. So remember, these are points, these are descriptors of those features lying in R power d and you are going to bin them into a very fine bin of features in that space. So it is possible that this blue point was lying in this bin, this blue point was lying in this bin and so on and so forth. You are just binning that entire R power d region into different bins and you placing each key point occurring in each image into one of those bins based on the descriptor values. Now, you have 1-D point it is X, Y on grid of size 1. We are going to call it size 1. This is the finest resolution. ( 07:51) So now we define histograms. So your level zero histograms are going to be this particular bin in your R power d grid as one feature. This particular bin in R power d in X has one feature. This particular bin in R power d has one feature in image X and one feature in image Y, so on and so forth. So you can construct your histogram. Obviously, it is possible that you could have one more feature here of Y in the same bin, but at the first level, we create these bins in such a way or you can always define bins at a very fine level. That is up, you create these bins in such a way that there is only one feature in each of the bins. We will obviously merge these as we go to be able to combine them in a more effective manner. So based on these histograms, when you try to match them, remember our histogram intersection is going to be the mean of each element. So you are going to be left with the intersection, which is simply one value here for this bin and one value here for this bin, all of the other bins have one of the elements in X or Y to be zero, which means they will get removed. ( 09:09) So you have two matches now between images X and Y and you are going to weight them by a value 1. So your total similarity score now is going to be 2 into 1, which is 2. ( 09:24) Now we are going to merge your histogram bins. Originally, if you had say about 20 histogram bins, you need to merge every consecutive, every contiguous one and make them into 10 bins. And now you see that it is possible that there are two features in image X that belong to the same bin and so on and so forth. So now we construct what are known as level 1 histograms where we count the number of features in each of these merged bins in image X and image Y. You see that there are two occurrences of features in this bin. Similarly, there are two occurrences of features in this bin in image X, but image Y has just one feature still in each bin. So based on that, we build the histogram for image X, build the histogram for image Y. And now you compute the intersection of these two histograms and you find that there are four matches. But you do not count all the four matches, you count how many new matches are added. So we are only going to look at how many new matches are added by matching these histograms, which is going to be we had two matches earlier, we have four matches now, the new matches would be two. So now you consider those new matches, you weight them by half. Why half, remember a match at a closer level is given lesser weight than a match at a finer level, because the finer level match means a closer match. So you take these two new matches weighted them by half and now your similarity score becomes 2 into 1 from the earlier slide plus 2 into 1/2, which is totally going to be 3. ( 11:19) Now, we continue this process, you now make your histogram bins just five in number, which means the number of features that you are going to have in each bin is going to increase. You can now have three features in this bin in image X, so on and so forth. Once again, you can get the histogram for X, the histogram for Y, you compute the intersection, which is now going to give you the number of matches to be 1 plus 2 plus 2, which is going to be 5, but you already had four matches in the previous level. So, the number of new matches is going to be just one, so the new match here is going to be just one. So the similarity score now is going to be given by 2x1+2x1/2+1x1/4 , because you are reducing the weight even further when you go to an even higher course level. So your total similarity score is 2 plus 2 into 1/2 plus 1 into 1/4 which is going to be 3.25. ( 12:29) So let us try to put this together. So given as set X which consists of n different features each belonging to R power d. Let us assume that the distances of those elements range between 1 and D. This just helps us build your bins for constructing the histogram. Once you know the maximum distance between elements, you can play around with your histogram bins to define them accordingly. So, we are going to define Xi as a histogram of X in R power d on a regular grid of side length 2 power i. So we start with histogram at level 1, histogram at level zero, level 2, so on and so forth. Technically speaking, we are going to start i at minus 1, but at minus 1, there are no matches. It is purely for mathematical convenience as we will see in a moment. And then we keep building the number of histogram levels until log D where remember D use the maximum distances between the elements. So now given two images with descriptors X and Y we are going to define formerly the pyramid match as K (X, ) γ(X)γ(Y ) (κ (X , ) (X , )) Δ Y = ∑ l i=0 1 2 i HI i Y i − κHI i−1 Y i−1 And at each level you are going to count the number of new matches. The first term counts the number of matches at this level, the second term counts the matches at the previous level and you are going to keep building that. So at each point, this is going to refer to the number of new pairs matched. So this difference can also be written, the summation of differences rather can also be written. It would, if you expand this, you would get a telescoping sum because you would have an i is equal to 0, you would have 1 by 2 power 0 into κH (X , ) (X , ) I 0 Y 0 − κHI −1 Y −1 which you ignore, that term is something that you ignore. Then you would have plus 1 by 2 into kappa of X1 by 1 minus kappa X0, Y0. So the X0, Y0 terms will be common between these two elements which will keep getting telescope. So if you put them all together, you would find that the telescoping sum can be written as 1 by 2 power L into κ(X , ) , which will be at the highest level plus all of the other L Y L terms will get, so, for example, let us take one particular example. If you take i is equal to 1 and i is equal to 2. At i is equal to 1, you are going to have 1/2 for simplicity we just going to read it as κ(X , ) (X Y ) . And at i is equal to 2, you are going to have 1 Y 1 − κ 0, 0 κ(X , ) (X Y ) . 4 1 2 Y 2 − κ 1, 1 So this κ(X , ), κ(X , ) will get subtracted and you will be left with 1/4 into 2 1 1 Y 1 4 1 1 Y 1 kappa X1, Y1 and that is what you writing out here. So which means X1, Y1 would have only quarter left because one of them will get cancelled. So you will be left with 1 by 2 power i plus 1 kappa of Xi, Yi. In case this is just a simplification of the telescoping sum that we see in the above equation. So this is just a mathematical representation of the example that we just saw over the last few slides. ( 16:32) Now, it can be shown that this K delta function that we just defined actually happens to be a positive definite kernel. Remember again, if you recall your discussion of kernels in support vector machines and machine learning, you will recall that a positive definite kernel has benefits because it satisfies the muscles theorem and the computational efficiency increases, if your kernel satisfies this property. Let us see how that holds here. Recall now that K delta is written as a weighted sum of kappa HI terms with non-negative coefficients. What are those non-negative coefficients, 1 by 2 power i. those are non-negative coefficients. And then you have a weighted sum of different kappa HI terms. This is the terms that we are referring to. That is what K is. Or if you look at Δ either of these equations it is simply a weighted sum of kappa HIs. And we also know that each of these κH s which is your histogram intersections is simply a min of values I ′ in each bin. So it is a summation of min terms. Now we know that min can be written as a dot product. How, if you had a number 3, and if you had a number 5, I can write 3 as I put 1, 1, 1 for the first three values and then zero, zero, zero assuming I can go up to value 8. Similarly, for five, I have 1 in the first five indices followed by three zeros. Now, the min of these two values, which is three, is simply a dot product between these two binary vectors, which means I can write min as a dot product and the rest of it now would fall in nicely because a sum of dot you would have min to be a dot product, the sum of min terms can also be written that way and a weighted sum of such kappa HI terms with non-negative coefficients can also be written this way, which means you can write your entire K delta as a positive definite kernel. In case there are parts that are not clear to you, please go ahead and read the pyramid match kernel paper to be able to get a better sense of this. So, one question here is we just said here that min can be written as a dot product by writing out each of the numbers that you have there in this form. If you wrote out each of those numbers in this particular form, then min becomes a dot product. So you could ask me the question. You simply extrapolated the dot product to a sum of min terms and then sum of the min terms to a sum of kappa HI terms with non-negative coefficients and continued that as positive definite. Then what would be the representation of the elements on which K delta is a positive kernel, what would be that embedding. For min, the embedding was writing it out in this manner, writing each number out simply as in enumerative, in an enumerative way. What would be the corresponding embedding upon which K delta becomes a positive definite kernel. To know what the embedding is Let us try to analyze this a bit more carefully. ( 19:57) So if you had two images X and Y for convenience let us assume that X has a lesser number of features than image Y. Remember both of them are unordered set of features. It could be the other way too. This is without loss of generality. In that case, it would just be flipped. But otherwise you can assume that one is less than the other in terms of cardinality of features. And that is define a function pi that takes us from image X to image Y in such a way that pi is one-to-one, which means for every feature in image X, you find the closest feature in image Y. In that case, the optimal pairwise matching would be given by, you take a feature from image X, you find the corresponding closest feature in image Y, you take the L1 distance between these two features and you are going to find the pi of the function that takes you from image X to image Y which gives you the least which, sorry, which maximizes the reciprocal of this distance. Remember, reciprocal of this distance is going to give you a sense of similarity because of the reciprocal you want to find the function pi which gives you the maximum such distance. For those of you who are a bit more familiar with distance metrics, you would find that such a representation is similar to what is known as the earth mover's distance, which is given by min |x (x)|| . Remember that this is a distance metric, while this π ∑ x∈X | − π 1 representation of optimal pairwise matching is a similarity measure, which is why you have max here and you have a min here. Remember that distance and similarity are complimentary ideas. If one is high, the other should be low, so on and so forth. So it happens that defining X the way we did, where we defined it in terms of grid locations and histograms and so on and so forth and taking a one norm between those intersections, actually gives us the embedding. For more details of this, this could be a bit mathematically involved, but for details of this, please see this particular paper called fast image retrieval via embeddings. But the core idea that you want to take away from here is that the pyramid match kernel defines a positive definite kernel which makes it efficient because we know that a positive definite kernel that satisfies the Mercer's theorem has a certain benefit in computations using the kernel trick and also that the embedding that corresponds to the kernel comes from a, can be related to the L1 distance between these X values and this particular paper describes this in more detail. And remember that once again pyramid match kernel is a similarity measure as any other kernel functions and it does not penalize clatter except for the normalization. By that what we mean is it is possible that many features could be congregated in a certain section of your entire R power d space and you are not going to penalize it because that would just increase the histogram intersection count in a particular bin or so on and so forth. There is no penalization for that. The only penalization that you could have is the normalization factor that you may be having here in your kernel definition. ( 23:51) One could extend this instead of dividing R power d into a uniform grid where you count how many features are lying in each of that R power d grid. You could also cluster all your features and now do it based on a vocabulary. So you could construct your entire histogram based on, until now in the method that we discussed, the histograms need not have been based on a vocabulary, they could have just been dividing your entire R power d d into several bins and counting how many features occurred in each of those grids. But you could also consider clustering them, clustering your key points into vocabulary and then building your bins based on those cluster centers. This would simply be an extension of the method that we have so far, where we would replace the regular grid with say hierarchical or non-hierarchical vocabulary cells. And compared to the vocabulary tree earlier at the beginning of the last lecture, we talked about how hierarchical K means can be used in bag of words. And we said that one of the concerns there is, there is no principle way of giving weights to each level in the tree. Now, in pyramid match kernel we actually have a principle way which has given by 1 by 2 power i. Even here, the approximation quality can suffer at high dimensions simply because of the curse of dimensionality and how distance get distorted in higher dimensions. ( 25:25) One could extend this idea of pyramid match kernel to do a pure special matching approach. So far, we talked about dividing. You take all the features from different images and you divide entire R power d which is the D dimensional descriptors for the features into grids and then build your histograms. But you could also build these histograms on your image space. In this context, what you will do is, let us say you have an image such as this, there a person is performing a certain action. You could divide the image into four parts, into 16 parts and so on and so forth. And you have two different images. Now you can do matching based on histograms. How many points belong to this bin, how many points belong to the top right bin, so on and so forth. Clearly in this approach, you are only considering the coordinate locations of the features. You are not considering the descriptor or the appearance of how that feature looks at all. But this approach could be used in trying to match say a person's position or how different a person's position was with respect to an earlier position so on and so forth. So this can be used, but has its own limitations, because in this case, you are simply counting how many the histograms turn out to be in the spatial image space, dividing the image into parts rather than taking the descriptor of the key point and doing the histogram in the descriptor space. So you are only considering coordinates here or the geometry of the points in the image rather than how each of those key point appears. ( 27:14) You could also combine these ideas to perform what is known as spatial pyramid matching. This was an extension of the pyramid match kernel. In this context, what you can do is you have a level zero again, very similar to pyramid match kernels, where you take a set of vocabulary, you cluster all your features into a vocabulary and then you count how many points belong to each of these cluster centers and you would get, say, histogram bins, such as these. Now, you divided your image into four parts. And now similarly, get a histogram bin for each of these visual words for each of these segments. For the top left segment, you would once again get a histogram of three bins. For the bottom right segment, you would get a histogram of three bins, so on and so forth. So the three bins come from the vocabulary guided pyramid match kernel, where instead of dividing your descriptor space into uniform bins, you build cluster centers similar to bag of words and then you count the number of features belonging to each of those visual words. You can once again divide the image even further. Now, you are going to get even higher number of histogram bins corresponding to each of these locations. So in this case, your kernel is going to be, you have your pyramid match kernel, but you are now going to do that for each part of the image and add them all up. So the pyramid match kernels still exists for each part of the image and then you keep doing this over different parts of the image. ( 28:51) So, one could look at it as a joint appearance geometry histogram. So pyramid match kernel was a pure appearance histogram because you had building the histograms in the descriptor space. We saw an example of how pyramid match kernels can be brought to special matching, which was a pure geometry histogram and spatial pyramid matching brings these two together to create what are known as appearance geometry histograms. So these are robust to deformation, not completely invariant to transformations, but fairly robust to deformation by simply the process that you are defining, where you are considering the appearance as well as where each of these features occurred in a given image, which was not there in the pyramid match kernel at all. So this can be used for global scene classification where a different organization of objects should not distort your final result. ( 29:55) A last method that we will talk about in this lecture is hough pyramid matching, which is clearly an extension to hough voting if you recall. So, in this method, the idea is, remember that in typical pyramid matching, you would take a set of features and match them to features from another image and you could do this in a fast manner by using image pyramids if you recall discussions in earlier lectures, where you first do matching at a course level, then to final matching at a deeper level of the pyramid and so on and so forth. ( 30:35) So you could have a bunch of correspondences which you get from matching at the level of key points. And what we are going to do now is work with these correspondences instead of two sets of unordered features. So you have a set of correspondences that you already get by doing fast pyramid matching. And remember the central idea of hough voting is each of your correspondences votes for a particular transformation or you have a hypothesis of transformation based on say the rotation angle or scale or translation and each of these correspondences votes for a particular hypothesis and we are now going to build histograms in that transformation space. ( 31:21) Let us see an example here. You could assume that a local feature P in image P has a certain scale, orientation and translation to it, position to it in this particular case, so which is given by this transformation matrix. Remember that this transformation matrix is just a different way of writing what we saw earlier, where we saw you have rcos θ rsin θ -rsin θ rcos θ tx, ty, zero, zero, 1, which constituents in a fine transformation where r is a scale, theta is the orientation and tx and ty are positions. So this is just a concise way of writing such a matrix. So, these two zeros correspond to this zero vector here. One is there for mathematical simplicity and then this s(p), R(p) corresponds to the scale and orientation of that point P, which can be written as a two cos two matrix and this vector t of p corresponds to the position of that particular point in the image. ( 32:41) Assuming this is how a local feature is given to us. Then a correspondence between a pair of features p ∈ P and q ∈ Q can be given by, F(c) = F(q)F(p) , remember Fp is one −1 point p’s representation, similarly, Fq would be the point q’s representation in image Q and the corresponding, the correspondence between these two points is given by [[s(c)r(c) tc],[0 1]]. Once again this boils down to your rotation and scale matrix coming here, your translation tx, ty coming here, and your zero, zero, 1. Now, tx, ty are not just the coordinates, they are how much you moved from the coordinated image X or point P to the coordinate q in image Q. Similarly, the scale and the rotation tells you, what is the transformation, how much did you rotate to go from image P to image Q, and how much did you zoom in or zoom out to go from image P to image Q. So tc, so we are not going to go deeper into this, but just to complete this discussion, this tc can be written as tq, which is the coordinate location of Q minus sc Rc tp. Why is that so? tq is the position in of q in image Q, tp is the position of point p in image P and sc, Rc says, how did you rotate p and how did you zoom p to get to a point in image Q and the difference between those two locations is going to be the actual translation tc. Similarly, you can define the relative zoom in or zoom out to be the scale in q divided by the scale in p and the rotation, similarly, to be given as Rq into Rp inverse or the angle is given by the orientation and image of point q in image Q minus theta of p the orientation of p in P. This is how the correspondence is given. ( 34:52) So, now let us come back to hough pyramid matching. So which means the transformation can be given by a 4-D vector t(c) which as tx and ty, s(c), the scaling factor, and θ(c) , which is the orientation of the rotation difference. So you are going to define one more thing before we go into the hough pyramid matching, where if you had two correspondences p, q, and p’, q’, we say that these two correspondences are conflicting if either p is equal to p’ or q is equal to q’ or rather if two points from image P match to the same point in image Q or one point from image P matches to two points in image Q, you call such correspondence to be conflicting. You are going to see how to use this when we go to the next slide. ( 35:43) So let us see how the hough pyramid matching actually works. So you have a set of correspondences now, which are laid out in your 4-D space, remember each correspondence as tx, tx, s and θ . So in this 4-D space, you are going to have each of these correspondences laid out. Now you should be able to draw similarity to the pyramid match kernel, because now you are going to be doing all your pyramid matching in this 4-D transformation space and that is why we call it hough pyramid matching. So each correspondence c is weighted by some w(c) based on some say visual word. You can choose to use this or you can give a, you can have a uniform weight if you choose. ( 36:28) Then at level zero, which is the first level of matching, remember where you have very granular bins. If there are conflicting correspondences in the same bin, you are going to erase them. For example, you see that c7 and c8 have two different points from image P matching to the same point in image Q. So you are going to remove one of them. So c7 is removed in this case and you retain only c8. ( 36:56) Now, in each of these bins in this pyramid, remember this binning now is in the transformation space again like that 4-D space of translation, scale and rotation. So, in each of these bin b with say nb correspondences. So, for example, this bin you have two correspondences, this bin you have three correspondences. So each correspondence groups with two others. In this case, there are three. So each correspondence groups with three others and your weight at level zero is going to be 1 very similar to how we did it for pyramid match kernel. So, now, if you see here, you see that you have the similarity scores now, which is given by, you have for c1 which is here, you have two new points on two new correspondences .
Log in to save your progress and obtain a certificate in Alison’s free Understanding Visual Matching in Computer Vision online course
Sign up to save your progress and obtain a certificate in Alison’s free Understanding Visual Matching in Computer Vision online course
Please enter you email address and we will mail you a link to reset your password.