Over the lecture so far, we've talked about basic methods to process images. We talked about operations such as convolution correlation. And then we talked about how we can use such operations to detect edges in images, corners, and images, different kinds of partners, different methods to extract those corners, as well as how do you describe these corners in ways in which they could be used for further tasks? We also talked about how this process could be similar to how the human visual system also perceives the world that we see around us. One of the aspects that we mentioned is if you have two different images and let's say you want to stitch and another about consisting of these two images, or more than two, we ideally detect interest points in both of these images. Get descriptors of each of these bullet points in both of these images. And then we match points across these images. How do you match is what we're going to get into next? Or the next two lectures? We talk about a few different methods to match key points between images. Not just key points between images. We will try to use these methods to do other kinds of tasks, like finding different kinds of shapes and images such as circles lines, or whatever shape you like as well as even more descriptors of, from what we've seen so far. Most of this week's lectures are based on the excellent lectures of. Professor Yanis at the university of rent, Andrea in France. If you recall, we gave this example earlier of two images taken of the same scene, perhaps from different viewpoints, perhaps a different parts of the day, or perhaps, but just different human nations or different Canada. Balometers. And if you want to stitch a up of these two images, the standard process is to find key points and match them. So we know how to find people points in both these images individually. We also know how to describe each of those key points as a vector. We've seen SIF, we've seen hog, we've seen LBP, we've seen a few different methods today to do this. The question that's left is if you now have the key points and descriptors from two different images, how do you really match them and be able to align? That's what we're going to do next. We'll start with a very simple method called dense logistician to optical flow, a fairly old method, which pertains to a setting. Where you have very small change between different images. So if you, again, taking the example of your cell phone, if you're going to gradually move your cell phone or with a scene, and then you want us to Japan on a map, the differences between successive images is going to be very little. So if you've tried this yourself, you will notice that in certain cases, if you move your hand very fast, You will get an error message repeated and move your hand very slowly to get up and out from the app on your cell phones. So in these kinds of cases, the displacement of the scene between statistic images is very little in these settings. You can use this kind of a method or dense legislation to optical flow. Here is a visual example of a scene where a book is going across water. You can see that the scene is more or less the same, but a few changes in the positions of the box. Our goal here is for each location in the image, say a key point in the image. We want to find a displacement with respect to another reference image. Once you have a displacement, you can simply place one image on top of the other image and be able to align them. So this kind of a method of using dense registration is generally useful for small displacements, such as stereopsis or optical slope, to understand how to do this. Let's first take a one dimensional case. Let's work out the map and then we'll go into a two-dimensional case. So let's consider the one dimensional case. Let's consider a function F of X, which is given by this green cup. And let's consider this function GFX, which is simply a displaced version of the scene F of X, but other mathematically speaking, I can see that G of X. Is F of X plus B, it's just a displaced version of F of X. And we also assume that T is small. We are only looking at small changes between such as of images. We know by definition, by first principles, definition of deliberative, you can see that DF by DX is given by F of explicity minus F affects biting world. Limited and into zero, which would be the former definition. But we know now that F of X plus the, this G of X. So which means we can write BFID X to be G G affects minus F of X by D. Where do we go from here? Now we define the error between these two. Signals in this particular case because we're considering a one dimensional equal. And right now it's going to be some weighted combination. Let's assume that this is very similar to the weighted auto correlation that we talked about for the hardest Connor detector. Just that in that case, we talked about auto correlation. Here we are looking at differences between two signals, F N G. So you have F of X plus T and G of X. That's going to be the difference. And you do a weighted combination of these two to be able to find the actual displacement. So you have WX in the F of X plus T minus GX squared. Now this second, this, this first time you're using a first starter daily. See this expansion. Can be written us F of X plus T transpose, Delta F of X. The remaining domes are the same across these two equations. The first step is simply expanded as a first startup daily series expansion. And you get the right-hand side of this equation. Where do we go from here? We know that the error is minimized when the gradient vanishes. So we take the E by Doherty. Which you're just going to take a simple derivative of this right-hand most of which is going to be w of X summation of X, w of X. That part stays the same as here. And the term that depends on D is this particular term. So if you take the gradient of that, you're going to have to into the entire term, inside the brackets, into the tem, into the delivery of the tongue, that's affected by T. Which is that of X. So you're going to have to delve X into the entire dome inside the brackets. We want to see great into zero and then solve it for what you're looking for. So now simply expanding the equation. You can simply take Tums out on both sides and ride this out dice space is going to ignore the summation and the arguments, just for simplicity of the, of explaining this. If we ignore those, you'd have w into Delta F into Delta F transpose. These stems are branched here. Similarly w into Delta F into G minus F. If you take it the other side too, doesn't matter because they are anything equal to zero. I'm trying to find out, trying to solve for this. So, no by doing this, you can solve for the Delta F and be able to figure out the displacement between these two signals. What is the two dimensional equivalent? It's exactly the same set of equations. Just that instead of one, the signal you'll now have an image patch that's defined by a window w and we then try to find what is the error between the batch shifted by T in reference image F. And the patch act origin in shifted the veggie. If you move F by a certain date in the original image, do you get G is the question that we want to ask? We want to find that D that minimizes this change, because that would give you the displacement between F and G. By solving for this, you can get the value of the, find the displacement and now be able to match or align these two payments. The medicine solution. One of the problems of this approach is the same approach of problem that we dealt with. When we moved from images to the hardest on a detective. Remember the approach of problem simply means that you can only. Solve this problem for a very local neighborhood. Why? So? Because the entire definition or the way we solve the problem, a zoom, a local neighborhood. If you look at the first starter daily CDs expansion, that approximation whores only for the local neighborhood, which means this entire formulation holes only if the displacement is insight of very small neighborhood. And that's the reason why we say that this method works and there is only very small changes with me, such as if we might just, so what do we do if there is more than a minor difference between these two images? For example, a few slides ago, we saw those images of those mountain ranges. It didn't look like those two images were displaced by a very small amount. It looked like there was a significant rotation or a significant perspective difference. In how those pictures were taken, how, how do you solve those kinds of things? And for that into what is known as white baseline, special magic in white baseline, special matching, uh, there's a difference from the density of stations. Just to repeat again, in denser station, we started from a very local template matching process. And we found an efficient solution based on a Taylor approximation, both of these make sense, small displacements burden, white baseline species matching vigorous, as you know that every part of one image appeared in any part of the second image. It's no more the Smartlist placement. You could have a counterpoint that was lying in the top left of one image and the bottom, right of the other image. And we still want to be able to match these points across these images. How do we go about this? Like each infusion is going to be the start by pairwise matching of local descriptors. So you have a bunch of key points and image one and a bunch of key points in image two for each of these key points, you have a descriptor. You now match those descriptors with the descriptors of all key points in the second image. Wherever you have the best match of descriptors. You're going to say that this point in image one is likely to match with this particular point, a certain point in image two, and these points could be a completely different coordinate positions in the first image and the second image. So we start by pairwise matching of local descriptors, but no other other order in posts. And then we try to enforce some kind of geometry consistency, according to a rigid motion model. So we know that in the real world, you can perhaps rotate an image, translate or move your camera or pan your camera. You can probably zoom in and zoom out. There are a few different transformations. That's generally possible. All of them is what we mean as a rigid motion model or geometry consistency. So we're going to zoom a particular model that could have taken place. And using these pairwise matching of local descriptors, you're going to try to solve what would be the parameters of the transformation between the two images. This is going to be the key idea, but we now talk about how do we actually go about, so once again, in white Bay baseline, special matching, you could have two images such as this, where a region in one image may appear anywhere in the other. There could be a zoom in zoom out. It could be a different. Angle, or it could be translated by somebody. Any of those could happen when we like to do this kind of a magic. So as we already said, we first independently detect features in both these images. So each of them are different features that you see across these images. Then we tried to do a pairwise descriptor matching for each detector feature. We can come up with a descriptor such as Instagram of oriented gradients, or local binary patterns or the variant of histogram of users. So on and so forth, your titer do a pairwise matching of the descriptors between the key points on these two images. Clearly when there is a lot of change between two images. It's not necessary that every key point will match it. Some key point on the other, in this particular case, you can see that the car does not even make it exists in a second image. So any the points on the car would not have an equal and match second image, which is perfectly fine with us. So only a subset of features that were detected in the first step. What actually lead to matches in both cases, in both these cases, the, in the first image, only a subset of features will match with the second image, even among all the features detected on the second image. One day, a subset of features from the second dimension would match with the features in the first dimension. How do you match? Once you get the descriptors in terms of vectors, you can simply take the distance to match. You can use other kinds of distances too, but you can simply use the Euclidean distance between the descriptors of the features and what these images to be able to match. So once you get these tenders, we try to a certain geometric model. For example, we can say that we know that in our particular domain, only a translation as possible. Only a translation and rotation is possible because in my camera, there is no zoom in or zoom out. It could happen. So if you knew what were the conditions under which a particular capture was taken? So, you know, what could be the transformation that could have taken place between the first image and the second match for a zoom, a certain energy transformation, and you find among those fairways correspondence, this correspondence is that we saw on the previous slide, which of them. What horror is due to this kind of reject transformation that I assume we come a bit later in this lecture as to how the digital transformation is represented and how do we find out points that are in life. We come back to this in a few slides on that, but this is the overall idea. So among all of those correspondences, you narrow, don't do a few, which satisfy your hypothesis of what would have happened. And then you that once you get that subset of inlet, you can simply match and find the transformation and align one image on top of the other. So let's talk about those in more detail or the next two slides. So we first extract descriptors from the key points in each image. So for each detected feature, you could do something like construct, a local histogram of gradient and orientations. You could do other kinds of things, too. This is just an example. You find one or more dominant orientations corresponding to the pixel, the histogram, the member in SIF, we talked about finding. What is the orientation of each key points? That's what you're talking well, at that point, you may want, do we sample the local touch at a given location scale or orientation based on what feature detector you used? You could have a scale for that particular key point. So you could have a location for that key point. You could have a scale. You could also have orientation, so you could re sample the local batch. When they say the sample. If it's a rotated batch, you may want to re sample it by doing some interpolation, so on and so forth. Like you can. The sample, the local batch, and then you find a descriptor for each dominant orientation that gives you your descriptors to remember again, just like how he spoke for a sift. You could take multiple descriptors for each goal. Key point, if there are different orientations that are dominant, we talked about this. Okay. Now at the end of the, that step. We have a bunch of descriptors in image, one, a bunch of descriptors in image to part, as we go forward for each descriptor in one image, we find it's two nearest neighbors in the next image. Why do it's just one method you can, you can also take other kinds of nearest neighbors. If you like. If you, in this matter, we take two nearest neighbors and we then evaluate the ratio of the distance of the first to the distance of the second. So you have a distance between the descriptor and the first image, the first match in the second image and the distance of the descriptor from the first image, the same distributor, the second closest match. You the ratio between the two is one, which means both are good matches. If in one case the distance is very low, but in the second case, the students is very high. You perhaps now know which of them is significantly closer. You can pressure them to find out which of them are strong matches. So whenever there's an issue is small, you know, that you found a very strong match because the second year is distance is very far away. That's one issue would measure. So whenever you have a strong batch, you're going to consider that a correspondence. And you then after you do all these pairwise matchings, you have a list of correspondences between image one image two. What do we mean by correspondences? You simply saying that descriptor. The one image, one corresponds to the image, two, something like that. You can design over the table of correspondences between these, between the descriptors of these two images. Okay. Here is your last station of the issue of best. So you can see here that for correct matches, you can see that the ratio of distances forms this kind of a distribution it's much smaller, but that's incorrect matches the ratio keeps going up and further towards what do mean, correct matches. The ratio is going to be close to one, which means the first one. Match is as good as the second match, then you're not very sure whether the match is strong enough and the first match distance is much lesser than the second matches distance. You know that you're doing a good job. You can not say, say also expand this to more nearest neighbors and expand the concept of issue. If you like to get a more, uh, to get a better idea of robustness of this match. Once you've identified these good matches the more, and then tried to estimate which of them are in layers with the rigid transformation zoom. Before we go there, let's, let's try to find out why is this a difficult process by itself? Okay. We've so far spoken about a few steps. Firstly, we have to choose key points or these kinds of correspondences, which allow for a jammer with transformation that may not be trivial in several images, fitting the model or the geometry transformation to the correspondences that we have found could be sensitive to outliers. It's possible just by chance that your correspondence could have been wrong because in the new image, maybe there was a newer artifact that came in. Which was not done in the first dimension, which ended up matching the key point and the first dimension in that particular case, it could simply be an outlier match, which could make fitting your geometry model a little harder to find in layers to a transformation. You first of all, need to find a transformation so far. I kept telling you that you can assume a transformation, but assuming a transformation is not trivial. You need domain knowledge. You may perhaps need to do something more to be able to find out what should be the transformation in the first place before fitting these correspondences to a transformation in certain cases, such as outliers correspondences can also have a cross enemy. It's likely that in certain cases, the correspondences can lead to mistakes. It's possible that Hitachi may not have been the right descriptor to get correspondences for certain features. So you could have errors in these kinds of cases. And even layers are often less than 50% of your total correspondences, generally even lesser, but are typically less than 50%. So, which means the number of in layers that you want to be left with at the end is very few that you actually can play with. So for the next part, to be able to understand how do you match these correspondences to the logit transformation model? Let's actually talk about what we mean by geometry transformations here. What do we mean by the transformations here? And then we'll come back and then try to align the correspondences to one particular transformation. Given two images I prime. But equal to data points X and explain, we know that I have X is equal. Do I prime? I explain this simply says that across these two images, you could map the point X to the point expired in the second image, or you can write this us X prime is some transformation of X. We got the point explained by perhaps rotating the first image or by translating the first image or by zooming into the first dimension they're going to Netflix to all those kinds of transformations. It is rotation translation scaling as a dancer vision matrix. And what does he do? An operation that takes you from a vector in our square and gives you another vector in our script. The moment, any event, any matrix can be looked at as a transformation in this perspective. So given a point, a coordinate location, X Y in image one, the dancer mission may take speed, takes you to another point. Explain why pride in your second image. And this transformation is going to be it by injection. Which means it's a one-to-one match between image one image two every point in image, one matches to only one point in image two and every point in image too much is to only one point in image one, it's going to be a budget. Let's play the study. What it looks like these are transformation be sent. It's a matrix. So for a certain set of common transformations, these fairly well-defined, especially in the dead body transformations. And this has been extensively studied, especially in the graphics business vision that we talked about in the first lecture. So we'll briefly talk about this. Now you understand how the matching is done. So suppose you have this green triangle in the first dimension. And you translate this rather, you just move it slightly along the x-axis, the y-axis along the boat, but these axis, it moves to a slightly different location. In the second image. In this particular case, you would define the transformation to be given by a three cross three matrix, which has given us one zero zero one, which is the top two, top two cross two of this matrix. Then you have which corresponds to the translation along the x-axis. And translation along the y-axis. If you work this out and whenever you apply this transformation on X, Y, and one, one is simply used as a normalized coordinate to represent this transformation, we get an outcome that is exact why? Heck why? So let's analyze this a bit carefully. It's simply a matrix vector transformation. If you simply did a matrix vector translation, you'll actually see that this is just another way of writing a system of equations. And the system of equations says X plus PX is equal to explain. Similarly, you have Y plus B Y is equal Dwight blank. The third one doesn't matter. You're just going to have one is equal to one. It doesn't matter, but this is exactly what you're looking for. This is just another way. That's just a system of equations. I mean simply writing the system of equations in terms of a maintenance spectrum, transformation of matrix transformation on a vector to give you another vector. This is translation. Let's see one more. If you took a rotation, this green triangle is now simply rotated. There's no translation. It's only rotated. You can see the is, uh, port here for the translation, which means there's zero translation, but there is rotation. Yeah. And in this case, it's given by cost teed up minus scientists, scientists, cost data in the upper to cross two of the three costly metrics. I let you look at this more carefully. It's a simple expansion. Again, you would have X cost minus Y sign DDA is equal to explain and excited up. Plus Y cost data is equal to Y right? That's simply that presents a new coordinates based on your rotation angle. So you can see here that if you went back to the previous slide in translation, there are two degrees of freedom PX and B white. In rotation, just one degree of freedom, which is given by the data out of the transformation is called the similarity transformation, which has four degrees of freedom, which combines or addition has to be two degrees of freedom to do the translation. But you also have a scaling aspect here, which is given by art, which can change the size of the object. And the second image, let me see size or scale. Remember it will correspond to zoom in or zoom out in terms of the camera by meters. So now you have OD the duck DX and D Y four degrees of freedom in this geometry. Talk to me, wait forward. This is another example of a similarity transformation where you can see the zoom, the zoom out in action, where the art has a non-zero value or a non one value to be able to show a similarity transformation where art is open I looked down summation is known as the sheer transformation. You can see here as to how the triangle gets transformed in the image. One image two, this is known as shear where you apply pressure in one of the. On one of the sides of the triangle and extended and keep the other sites, perhaps constraint. So this is even by changing just these quantities, B X, B Y in your, uh, in your transformation and the rest of them stay one. So then it's for sheer, you can light up the questions of sheer as X plus B X, Y is equal to don't excite. B by X. Lastly, why is he credible Whitehead? This is simply a linear system of equation way of writing the transformation for the mall. A popular transformation known as the affine transformation is given by six degrees of freedom, where you can have values for any of those six sports in your transformation matrix that we spoke about. Okay, you're going to stick to these set of body transformations. At this time, there are lots of missions that also use these values at the bottom, which are going to protect your transformations perspective transformations. We are not going to get into it at this particular point in time. We want to stick to a fine observations. So in all of these cases, as you can see. Using those tentative correspondences that we get between doing matches. We can find out which express matches to X and Y in your image. One, explain why in image two could be matching with X Y in image one. So we already have a list of correspondences based on those matching of distributors. Our job is to find out what are the parameters of this transformation? That's what we want to look for. Clearly, this is about solving a linear system of equations. So we want to solve a linear system. X is equal to B where X and B are the coordinates of the knowing point correspondences from images, INI prime, and it contains our model parameters that we want to know. Ideally speaking. If we had the degrees of freedom in a given in a, in a given transformation, you ideally need the ceiling of Beibei two correspondences, for example, for translation, two degrees of freedom, which means you need a new one correspondence. If you have one point in one image, And another point in the second image, you can find both the X and the Y because you will know how much you move the next and how much you move. So, uh, given D degrees of freedom, you need about D Baidu ceiling as the number of correspondences from your, uh, from using your distributors. Okay. Now, how do you solve for this? Right. So we know now that just to recall, repeat what we've talked about so far, we found key points in each of the images we found descriptors, and then we matched the descriptors between these two images. And then based on the nearest neighbor approach, we prune those descriptor matches to a few set of, uh, descriptive matches, which are strong. And among those B now want to find out which of them will suit my Dick body model that I would assume for my transformation between the two images. So if I assume, and a fine transformation now, using those set of correspondences that I have, I didn't have to solve for. These six values. It's my transformation. And once I've solved for these values, I know what was the transformation between these two images? So I can simply place one on one image on top of the other, using the transformation again, and be able to blend them and create a Panorama. So we left at one task as to how do you actually estimate those parameters? Given those correspondences, let's start with the most simplest approach to be all. No. If you have two points for the line, that's the simplest approach that we all know the simplest model that we can imagine. Let's say that stoppers the bigger to use, but let's say to describe this further, so you can at least approach to fitting correspondences. This is what you would have. If you have a bunch of correspondences here, this is clean data, not many outliers. The least squares fit would give you a fairly good equation for the linemen. Yeah. We you're just talking about the transformation and the psyche one obstructions now, but we'll come back and make it clear as to how you really estimate the parameters of the time estimation. How about if there are clear in your matches, then the least perfect fields and gives a very different answer compared to what should have been. Right. So what do we do here? here are a few visual illustrations of how well ransack works for different kinds of transformations. Here is an example of rotation. This is the original book. Rotated in a certain degree and you can see that, uh, not a book. Sorry. I think it's a foot box. And this is the foot box placed in a different place in the second image and ransack finds fairly good transformations between these two settings. It also works well at estimating. What is known as a transformation matrix or a fundamental matrix menu relate to two views of the same image. If you have two different views, remember this is how you would build a 3d model of a given scene. And if you wanted to build a 3d model of say the statue, you would ideally take multiple images by slowly moving around this particular 3d object, and you would get a 3d model. And in each of those cases, between every bed of images that you've captured, you have the estimate, this transformation matrix, which is also known as the fundamental matrix in this particular case.
Log in to save your progress and obtain a certificate in Alison’s free Understanding Visual Matching in Computer Vision online course
Sign up to save your progress and obtain a certificate in Alison’s free Understanding Visual Matching in Computer Vision online course
Please enter you email address and we will mail you a link to reset your password.