Alison's New App is now available on iOS and Android! Download Now
Last lecture, we spoke about feature extractors and feature descriptors. In particular, we spoke about sift and its variance. Before we go to other kinds of features, we'll take a slight detour and talk about another important task in processing of images, which is image segmentation. Human vision has been studied extensively from several perspectives. And one of the perspectives of human perception has been based on what is known as the, just start tiering, which emerged almost a hundred years ago, which had this fundamental principle that the whole is greater than the sum of its parts. And one of the important cornerstones of this theory was the idea of grouping to understand human mission. So, this is a very popular optical illusion. It's known as though Mueller liar, usually, which asks of viewer as to which of these horizontal lines is longer. And because of the human perceptions bias towards viewing things as a group. And not the individual competence. The illusion actually arises here and makes one light look longer than the other while the horizontal lines are exactly the same land in both of these. It's just, just start theory, proposes various factors in images which can result in grouping. So on one hand, you could have proximity to be a reason when there are objects or artifacts in the image, which are simply close to each other. And these groups lie in different parts of the image that could result in one kind of a grouping, or it could just be the similarity of the objects themselves. Some objects are hollow circles, some objects are circles. And just because these objects are different. They get good group differently. When we perceive this dimension, other factors could be parallelism. There are groups of lines that have similar properties and dubs of paternalism. We end up grouping them, or another option could also be just symmetric. Just seeing artifacts in the symmetric sense makes us group them in one, one particular manner. Further, you could also have similarity or for different kinds. You could have a common fate in terms of these arrows telling you where these boondocks are going. So in the second example here, all these dots, our model is laying in a straight line, but the arrows let us group them differently. So in a sense of where they're going to go along the direction of the arrows, Gives the human visual system, a cue to how they should be grouped, or it could be purely because of a common region that was already marked as a circle. It could be due to continuously have artifacts in good video to closure of artifacts. As you can see, there are many factors that result in human perception, viewing things as groups. And just start hearing it's fairly descriptive. It lays down various different rules and principles that could lead to grouping. When humans perceive images. How about these rules by themselves clearly defined enough to become algorithms. So you really cannot take the rule directly and implement the pseudo code for that tool that could help you find, say groups in an image. So for the rest of this lecture, we're going to talk about that task core image segmentation, where we try to group similar pixels into different groups. You could call this akin to clustering where you group different pixels into their own, uh, clusters. So a significant part of this lecture is taken from David faucets, this book and selling skis books. And we would share these influences at the end. One of the oldest methods for image segmentation is known as watershed segmentation. This was developed way back in 1979. Uh, an image segmentation remember is the task of grouping pixels in an image. So that's the task that you're looking for. The watershed segmentation method. Segments and image into what are known as catchment basis regions. It goes along with the name. That's why it's called watershed segmentation. That'll be, become cured enough in the next slide. So you can view any grayscale image as a 3d topological surface. So for example, if you have this gray scale image, Very similar to how we viewed an image as a function way back in one of the earlier lectures. So you can view an image as a 3d topological surface, where the white miss peaks at these, um, Beverly uh, locations and the whiteness subsides in other locations. And the dark areas are where you have black pixels in the original image. What water should segmentation place to do is to segment region segment the image into regions with a concept idea of rainwater flowing, flowing into the salt SIM Lake. So you're going to create a methodology. We are going to zoom the degree of flood water in your image and wherever you have catchment basins in the image. Imagine this 3d topographical representation of the original image. Wherever you have water stagnating, those groups of pixels form segments. Let's see this in a bit more detail. So you identify various local minima in your image. Remember local minimum as simply values. Where are pixel locations where your image intensity value is low. Remember again, that image intensity can lie between zero and two 55. Or zero and one, if you normalize the values in the image, so you pick your, take certain neighborhoods say, uh, what about a five-by-five whatever that may be and using the process you find out local minimum that light across the image. And you flood the landscape from local minima and pro proven merging of water from different minimums. So which means you would start with the local minimum and slowly keep expanding, which is equal to the flooding here until you go to a point which is reachable in the same number of pixels from under the local minimum. So that's where the bomb day between these two regions would lie. So this results, the simple process results. There's nothing very fancy about this other than what I just mentioned. So this process results in partitioning, the image into catchment, basins and watershed lines. Remember this method was proposed in the late seventies and it was used for many years to segment images into various parts. It was generally applied on image gradients rather than the image itself. Just to give you a visual example. So the original image is this one. We ideally want to group the pixels. So you first take a grid into match. Now, you know how to take a gradient to match without an edge freed up on the, on the original image. So now the watersheds of your gradient image argument in your bottom left image. And you finally define and get your segmentation output and put it back on the original image and the last it's a simple method, but works rather effectively. However, it does have some limitations if there is a lot of noise and , for example, let's say you have a heavily textured image, so you can imagine a lot of undulations in there, uh, in the, in the texture of the image. A lot of that is noise in dementia. You would end up with something like this because they're going to be lot of catchments. So now trying to flood those areas will lead you to many watershed lines that separate many different regions does. I'm thinking what is known as over segmentation. So keeping this in mind, what I should segmentation is typically used. As part of an interactive system where a user points or a few different centers in the image, and then the algorithm floods from those centers as alone, not worrying about other centers like that could be in the image. So this idea can help overcome the limitation of segmentation. If you'd like to know more, you can read chapter 5.2 0.1 in Solinski his book. This is one of the earliest methods. Since then, there have been many methods that people have developed and one could broadly categorize methods into two kinds region splitting and region merging in region splitting methods. As the name suggests the idea is to start with the image. And then keep splitting into finer and finer regions. We'll see one example of this method, a little later in this lecture, the other kind, which is a bit more popular are what are known as region managing methods, where you start with the pixel as a region and keep merging similar pixels and forming regions. And you can keep going forward all the way to the complete image. So here is an example of. Uh, an image which has been marked to the groups of pixels, groups of pixels are also sometimes called super pixels. And this is sometimes generally used as a preprocessing step for higher level segmentation of others. One of the popular methods to do image segmentation was graph based segmentation, which was proposed in the early two thousands by. That's in Schwab when they proposed the graph based segmentation algorithm, which used dissimilarities similarities between regions to find out which regions to March in a given iteration. So if you consider an image as a graph G with vertices V and it just eat, so you have a set of witnesses and a set of images and the pixels form new, what is this initially? I, it just, uh, define between, so we define a pixel to pixel the metric. For example, this could be based on the intensity values in these two pixels. You could also define this. You're using more sophisticated mechanisms by probably taking a neighborhood and taking oriented gradients and so on and so forth. But what about B that measure? You have a pixel to pixel the similarity metric, which is defined as w of E, which is the weight of that edge. Joining those two pixels on which we are defining that is similarity, where it just is given maybe when we do, when we want. And we do add two pixels, for example, this, this similarity, the metric could be the intensity differences between the, any neighbors. Off, both these pixels. When they say innate neighbors, you take the three cross three neighborhood and excluding the central pixel, you will have eight different neighbors. You're calling that as any neighbors. So you take two different pixels, take that, any neighbors and find that intensity differences. That could be one way of computing w of. Once w of E is defined, the method defines a quantity or intolerant difference for every region C initially the member, the Regency would just be a pixel, but in future iterations, the region could be a collection of pixels. The internal difference is defined as the largest edge rate in the regions minimum spanning tree, which means if you have a bunch of different pixels, you find the minimal spanning tree of the bunch of pixels that you have. Now you're digging the largest edge rate in the minimum spanning tree. You're going to define diet as internal difference. To some extent that's going to be the largest edge rate in that region. It's just a simplified way of saying that that's one point that if you are going to define another quantity, if you're going to define is for minimum internal difference, minimum internal difference is given by if you had two regions, C1 and C2. The minimum internal difference between these two regions is given by a minimum of the internal difference of C1. Plus is some manually chosen penalty. So, uh, I'll explain that in a moment. And then the second quantity is the internal difference of C2 and Well just being, seeing the number of pixels that you have in the region, maybe normalize based on that or so on and so forth, which we have a quantity and you take the minimum of what this point is. We have a region, one for whom you would define it in dollar difference. Plus some penalty region C2 who's intelligent difference, and it's kind of spawning better deal. I do take the minimum of these students. Once you have this for any too, just in regions, which are at least one edge connecting any of them. What does this, we defined a quantity called diff of C1, C2, which is given by the minimum weight edge, connecting these two regions. So for example, you consider all it just be one. We do such that we want to come from Regency one. And we do come from an agency too. So all ages between those two regions and you take the minimum weight of the edges that connect these two beaches with this quantity is defined. The method defines a predicate D for any two regions, C1 and C2 pass. If diff C one and sequel is greater than the minimum internal difference between C1 and C2, then you say it's true. Otherwise you say it is false. This illustration should help you understand this a bit better. Diff is the minimum edge rate between good agents, but us minimum intolerant difference. Is a minimum intelligent difference within each of these, between C1 and C2, whichever is the minimum. So if the difference between the two regions is greater than the minimum internal difference, you do not want to do anything. If it's the other way, regions would be marched. This predicted turns out to be false. You know, that diff is equal or lesser than minimum interim difference, which means C2 is as close to C1 as the farthest separated pixels in C1. So you may as well, Mudge C1, Pensacola. That's the main idea with this particular metric. So, uh, to summarize this algorithm merges any two regions. Who's difference is smaller than the minimum in dollar difference of these two regions. That is an example of how this works using the eight pixel neighborhood. So you can see that the results are fairly good. The different players get separated, reasonably well. Parts of the body gets separated reasonably. If you have more than distinct these look at chapter 5.2 0.4, instead of skis book, the next method we're going to look at it's another popular method or probabilistic aggregation, which was proposed by Alport in 2007. This method is again, a region matching algorithm, and it is a probabilistic approach. As the name says, I use this sort of in QS intensity values for the grade level values. As well as some texture content of no, a specific regions in the image. So initially the method considers each pixel as a region, and then you assign a merging likelihood B IJ. So for every two pixels, I and J there's a likelihood pig, which is how likely are these two pixels too much. And how do you assign that bag based on that intent, intensity and texture. A lot of these, you're not going to go into each of those details. I will point you to the reference. If you're interested in knowing how exactly it's implemented, you can look at the paper for those details, but it's broadly based on intensity. You know, how the compute intensity by now, you don't have the computer texture. We talked about it the last time. So the pig is based on intensity and texture similarities. The two pixels. Now, given a graph in a particular iteration, let's say it's an X minus one. Titration then get off is given by a set of what is this? V S minus one and E as mine was one. So the golf of the next iteration, GS, when you hope to Mudge a few regions is, is a thing bite. You consider a subset of seed nodes, C. From V S minus one. And you much notice or regions, if they are strongly coupled to regions in, see, see what we're doing here is you would consider those pixels. So which means you would say that if you took a subset of nodes in vs minus one Richard other notes, right? We would ideally consider other notes, which are in V S. Minus one minus C. Those would be the other pixels in the S minus one. You consider those pixels. And for whichever pixels you have V a S summation of VIG, Jacob belonging to C divided by summation J belonging to V S minus one minus CPAG. When there's an issue is greater than the threshold, you know, that there are certain pixels. That are quite likely to much as the pixels in, in the region seeds. So this ratio is great. And then the threshold, which the paper recommends to set 2.2, then you merge those legions. And once you merge those regions, you propagate those assignments. But the final level children, which would be the property of any region watching, I'll call them. And you do this process. Yeah. There's an example of how probabilistic aggregation works in practice. So this is an image taken from Selinsky his book. So you can see here that image a is the grade-level pixel grid. So it's a five by five image. So you can see the entire pixel couplings in B image. So a thick edge means a stronger coupling, a thin edge means we can coupling. So you can see that the pixels that are white are strongly coupled to each other because of the intensity similarity. They didn't see you perform one limit, of course, sitting where you combine edges that are close to each other. And just that you would take one particular pixel and you start that, which would be C. Then you would take all the other V minus one, minus C uh, of all the other pixels in V S minus one, minus C. And then you would try to find out which of them has a higher probability of margin based on intensity, similarity and texture similarity again, and then you keep repeating this process. You can see that after two levels, of course, you get a fairly good estimate of different regions in the image. Well, we go on to another popular method known as mean shift segmentation. This matter is a different approach and it uses a more finding technique based on nonparametric density estimation. So non-parametric density estimation is a standard to zipper that people in machine learning use to estimate the density of any given set of observations. So the feature vectors of each pixel in the image of the zoom to be some samples from some unknown probability distribution. And we are going to estimate the probability density function off of the image and find it more, in some sense, the borders would be like what we hide for the watershed segmentation. Although in that case, we did a minimum. Now we're going to talk about a maximum. And then the image is finally segmented. Once you identify the wards in the image, the image is segmented pixel Weiss by considering every set of pixels, which climb to don't seem more as a consistent segment. You can describe this for the next few slides. So let's consider an example image as you can see here. So it's a fairly real world image on the right. You see the presentation of the match. So we talked about color spaces in week one. So one of the color spaces that is popularly discussed is known as the starting star we start, or the lub color space. So this is simply for simplicity of plot of only the ill study. You start features of this image just for simplicity of plotting. So in this feature space, we ideally want to find out what are the modes of the overall distribution of the intensities in the end, starting with star space. So how do you find the modes of such a distribution? Yes. Uh, the call, your studies and machine learning. You recall that one of the popular upload approaches to estimate the property density function or modes? Would be through gunner. That's the estimation. So they're going to use the same method here to be able to estimate what is the PDF and what are the modes of the distribution of intensities in a particular space. Why we have taken example as the end user space, you could do such a more finding in any other space for that matter, too, just for simplicity, you're going to consider a one dimensional example. So remember, once again that this is a one dimensional signal. Which can be extrapolated to image as a duty signal. So the way we would estimate this function is we would convert it. Oh, the data let's assume that you have observations given by these impulses. So you can see those vertical lines, Louisa, your observations, that argument. So you would estimate your density function F which is what we want to estimate by converting your data, your observations. With some kind of bit hatch, let's assume that you have a gun on the cake. So you could do it as a falling off function where hedge is the wit and you have X minus X to be, uh, your function, your genetic function that you're falling off or so Cambodian your observations with this content. We give you an estimate. What is your function? That was that your observations were coming from. What do we do with this? So to find the modes mean, shift segmentation uses a gradient ascent method with multiple restarts. Why isn't because we want to find the maximum of all the modes of that probability distribution. So we start with some guests, why not for a local maximum. It could just be any point. Remember when to perform. Gradient ascent, which is an iterative procedure. So you start with some point and then every, every iteration you add some constant times the gradient to the previous estimate and you update your estimate again and again and again. So you have a guess, why not? Which could be any random point expired. So we calculate the density or the gradient of the density estimate F of X. We just talked about how if affects is computed using that gunner. Uh, how do you choose that content? We'll talk about it in a moment. So you have the gradient of the density estimate F of X, why not? And we take an ascent step in that direction. So the gradient is going to be given by something like this summation excite minus X, G of this quantity, where G is nothing but K prime, all the derivative of the funnel cake, simple limiters. So the gradient of the density function can actually be written slightly differently. You can write the gradient again, let's see the previous light. So this was the grid that we just type excitement is X in the G of X minus excite norms squared by head square. When G is K prime minus K prime or the gradient of color. So we got to show that this can actually be written slightly differently. Where the gradient can be at an estimation, G capital G of X minus excitable. The capital G of X minus excite is simply the same small D that we saw on the previous slide into MX MX is like a weighted contribution of the X values that you're considering in this particular competition. So excite in the G of X minus exciting. Divided by sum of all G of X minus X size. Why is this a correct way of writing from the previous slide? You want to leave it to you for homework? It's not the figure to find this out, but I do work it out stepwise and you'll actually get the answer. So this vector X, which corresponds to the gradient is also known as the mean shifted. It's the difference between X, sorry, if you, uh, I didn't point you to this minus X, but, uh, it's the main shift is the difference between X and the gradient weighted mean of the neighbors around X, X is very pleased. The Frito and X minus X eyes are the values in the range of that filter. And you look at the solution very similar to how we solve convolution earlier. In this case, we are only talking about one deconvolution for simplicity. So the vector MFX is the main shift. The difference between X and the gradient weighted mean of the neighbors. And once you get your main shift, which is also your gradient in this particular case, a subject was scaled up multiple. Then your next iteration in your gradient descent, your gradient ascent. You're going to say why key plus one is equal to Y K plus M of YQ. You can also meet it in the same. We would just have a Waikiki and expert in cancer because that's where you're placing Waikiki at this point in time. And you'll be left only with Emma Viking as the quantity, which will become YK plus, which will become YK plus model. And you keep doing this process I creatively and you would finally reach the more of the distribution. And then you would find out all the neighboring points, which by climbing would lead to this mall will form one segment, other points, which by claiming would reach to other modes. What those segments, clearly this method relies on selecting us suitable Cardinal wit in addition to the What with your juice is also important and the impact, the final dessert, and that is chosen. And particularly in the about description, the approach seemed to be color-based. We took it in the U B space. You can also consider other kinds of features of other kinds of future spaces to find the molds. Fantastic. Please look at chapter 5.3 0.2 in some skis book. One moment, which is not a region matching method, but a region splitting method is normalized for segmentation. Once again, a very popular method was used for many years, one second, a graph based method, but a different approach in this case, a graph representing pixels in an image. It's successively split into parts. It's not about marching this time. It's about split. How do you split it? Mark injury rates between pixels based on the similarity, sort of two pixels are very similar. Maybe they have similar intensities. You give them a high edge rate. And if two pixels are further apart in a similarity, you give them low weight. So here's an example of such a graph that's constructed. So the goal of this would be the fine, the main cut. You're not willing to work out the entire material. If you're invested, you can look at the differences and learn more about it. But you're trying to find the main cup, which is going to give you two different regions as shown here. And those edges, when you remove you get the corresponding two segments, which is your ultimate objective. The main cut is defined as some of all weights being comfortable. So you consider two sets a and B two sets of what this is. And for all I belonging to a and J belonging to B you add up those edge rates, wherever you got get for which you have a choice of a and B you get a minimum cut value. That is the car that you're going to choose to divide the graph into two segments. So clearly as you can observe here, this method only divides two segments at a time. So if you want more segments, we'll have to do this more and more times, or probably use other heuristics. If you wanted to combine them later into fewer signals. But one of the problems with using the mini cart is that it can also result in previous solutions. Why soap, can you think about it? Why does this main card problem start in tribute solutions? Okay. The answer is very simple when it consists of say only one pixel. And B consists of say another region of pixels, this submission, maybe the stricter to a smaller set and hence may have a smaller value when you sum all of them up, which means when could favor regions where there's only one pixel on one site, which may not be, which may simply be an outlier and may not really form a good segmentation for the image. How do you overcome this problem? This problem is overcome by a method or normalized, which improves the main card problem. And the normalized is defined by card of AB by association of AB plus cut of AB by association of DV. We'll see what they are. Association is defined again, as sum of all the weights in a and B, where association of AB is given by association of AA plus association of AB it's the sum of all of the weights. And now what you've done by doing this association of AB and association of BB is two. Ensure that the denominator contains the number of pictures. So, uh, in each of these, in each of these regions, so if there was only one pixel in one region, I remember that one of these quantities will become very high as against another region, which are many more pixels. When the denominator will pull down the menu. This ensures that you don't get revealed solutions such as what you saw in me, but you get more useful solutions in practice. So the realized gut happens to be an MP complete problem, but the approximate solutions to solve this slightly more in war. Uh, but you can look at the people or normalize emit segmentations. If you're interested in how the matter is actually implemented. That this was the oral formulation he had as an example of how normalized cups works. So this is the input image, and you can see the various different regions get that gets segmented using normalized guts or the iterations of the metal. Okay. And many other methods for the med segmentation do such as simply using k-means clustering that have also been methods based on Mark, over random fields and conditional random fields. And many more. If you want to have a detailed understanding, please read chapter five of silver skis book that a motor fences at the end of this lecture too. But the question that people ask now is. Do we really need all of these segmentation methods. So think the entire course is based on deep learning for computer vision. Aren't there deep learning methods that can segment the ministry. You really need all of this. It is a valid question for many tasks today. You do not need these methods. The purpose of covering these topics is to give you a context of computer vision in the 3d planning setup. But some of these methods have also been used in the deep learning era. For those of you who are a little bit more aware about these areas, The earlier methods of deep learning that were used for detection of objects, which means given an image, you're trying to find an object and draw the bounding box as to where the image, that object lies. That is the task of detection. And there could be multiple instances of objects or multiple objects in an image for the initial methods of deep learning for object detection. It was, uh, these kinds of segmentation methods that actually used to generate what Vermont has written proposals, where the objects could, like, as I said, one of the first us CNN, which is a Regency and, and what that was used for object detection use the main card segmentation method known as CPMC constrained, parametric main cuts to generate written proposals for full-blown signals. We will talk about those in detail when we come to that section of the school. But that's the reason why, uh, image segmentation is useful for you to know at this time, segmentation can also go beyond images into videos.
Log in to save your progress and obtain a certificate in Alison’s free Computer Vision - Visual Features and Representations online course
Sign up to save your progress and obtain a certificate in Alison’s free Computer Vision - Visual Features and Representations online course
Please enter you email address and we will mail you a link to reset your password.