Loading
Study Reminders
Support
Text Version

Set your study reminders

We will email you at these times to remind you to study.
  • Monday

    -

    7am

    +

    Tuesday

    -

    7am

    +

    Wednesday

    -

    7am

    +

    Thursday

    -

    7am

    +

    Friday

    -

    7am

    +

    Saturday

    -

    7am

    +

    Sunday

    -

    7am

    +

to complete this week's lecture, but you just summarize some of the things that you've seen so far as we'll be transitioning to deep learning from the next week. What we've seen so far. Is a breezy, somebody of work in computer vision that took two to three decades. So we've covered a few topics, but we've covered. We have not covered several more, uh, an important topic that we've probably missed is part based approaches. So on and so forth. Hopefully we'll be able to cover that in a future, of course, but we've tried to summarize the learnings that we've had so far, which will perhaps help us. And transitioning to going to deep learning for computer vision. So one of the things that we learned so far is that convolution is a very unique operation. It's linear shifting Marion has useful properties, such as competitivity associated with being in distributes or additions and so on and so forth. So it's very unique in its processing of signals. It forms the basis of image operations. It also forms the basis of neural networks, which is the ones that are used in computer vision. Most commonly known as convolutional neural networks. So communication still remains used to this day, even as part of the plumbing. We've also seen that the common pipeline in traditional vision tasks is given by. We typically extract some key points, but interest points in images could be edges or could be key points that have significant change in more than one direction. And we then extract descriptors out of these key points. This was a common theme. If you saw over the last week of lectures, at least. That'd be tall also. So an idea of trying to use banks of filters, such as steerable filters, or give off freed us to be able to get multiple responses from a single image, and then concatenate them to be able to do any further task or processing. We also saw that these descriptors are useful for tasks such as the pre-bill matching or classification. If you had to obstruct out the understanding that we had so far, it is about the fact that each of these methods that we spoke about, we went from low level image, understanding to aggregation of descriptors at a higher level. So we use banks of filters. Capture responses at different scales and orientations paraben-free does give off your dose so on and so forth, but they were histograms, which could be considered as doing some form of coding because you're trying to quantize, uh, different key points into a similar scale, or even doing some kind of pooling of features to a common cluster centroid or a common core book element. So one could see that there are some similarities here. Between how this processing was happening to how the processing happens in the human visual system. We at least briefly talked about it, about the various levels of the human visual system, which also bears a similarity of trying to get different kinds of responses at different orientations and skins of the input, visual, and then trying to. Assimilate and aggregate them over different levels in the human visual system. So there is a similarity here, although it was not by design, perhaps it was about solving tasks for computer vision, but there is a similarity about trying to get some low level features, probably features of different kinds with different skills and orientations because choosing only one feature can be limiting for certain applications. So you want to use. A bank of different responses and then combine them and be able to assimilate them for further information. Another important thing that we also learned over the last few weeks is that there are applications for which local features are more important. The entire image may not be important. It may be important for certain tasks, such as image level matching. Maybe, uh, an image level search on one of your search engines or there could be tasks for which only the local features are important. For example, a certain key point. Or you want to find the correspondence between partially matching images, so on and so forth. So it depends on the task. Stereopsis is about detecting depth and images. If you want us to emotion, or if you want to recognize an instance of an object, rather than just recognize a class in an image, it depends as to whether a local region matters or the full image matters. We also saw that including using methods, such as bag of words, Can make your image, the presentation spots, for example, it's possible that if you had say 10 cluster centers in your, uh, k-means for bag of words, it's possible that one of your images in your dataset may have had only features belonging to three of those cluster centers. The remaining seven cluster centers had no occurrence in that particular image. Which means your image would have a histogram where for three of those bins, you would have some frequency comes, but the rest of the seven bins will have a zero comp that leads to a sponsor presentation, where there are lots of zeros for that particular image. So encoding can result in that kind of representation for an image. And an important takeaway here is that a lot of operators that detect local features or even global representations of images for that matter can be viewed as performing convolution against some estimate of features because the detector key points you need convolution is the key operation that you're relying on. And then that's followed by some kind of a competition. And so, for example, be it. Uh, the cluster centers. So each of the clusters that does is trying to win votes of different features that correspond to that cluster center, and one of them wins. So there seems to be some kind of competition or pooling of, of the result of the communication operation, which leads to the next step or a higher level understanding or description of the image. So we also find that. The goal so far has been to learn descriptors and representations that make it easy for us to match. You don't want to spend too much time on matching of course, view some intelligence and coming up with matching condos and so on and so forth. But the key idea is to be able to describe key points, describe images in such a way that a simple doctor or simple matching condoms can be used to be able to match images or parts of images or regions in images. These kinds of descriptors are, have some ingredients to geometric transformations, a certain scale, a certain rotation, certain translation, but in certain cases that's designed in the algorithm. In certain other cases, they may have to be learned to other. This is a brief summary, many of the topics that you've seen so far put into an abstract manner, put into a concise, succinct manner. But what we are going to conclude with here is to show that we're going to move to deep learning. As I just mentioned, although not by design deep learning seems to be building upon some of these principles. Some of these are going to become clearer when we start discussing these deep learning approaches. But we see that the idea of trying to detect low level responses on off. Images to different kinds of Fritos and then aggregating them and building high level obstructions. And then going to a point of, uh, a task where the last representation becomes very simple for a task seems to be very simple, very similar to an idea that deep neural networks also seem to use for solving vision tasks. Although this may not have been by design. It seems to be similar in the overall structure, but the key difference between all of these methods that we've seen so far and what we're doing to see the deep learning over the next remaining weeks of this course is that in deep learning, all of this is done in a learnable manner, rather than we having to design. Which key points should I use or should I use vetted descriptors? Should I use, should I use audit brilliance or should I use DLO head? Should I use local by any patterns? All of these become design decisions that sometimes become difficult because they may depend on the task. And that is, there was no complete knowledge on which kind of a descriptor could be used for which kind of a task. For example, for face recognition, would local binary patterns be always the choice of a feature or put something else. We used this kind of a complete understanding of which method to use for which task was not very well known and deep neural networks. Having some sense, change the game there by doing, by simulating a similar pipeline, but the entire pipeline is purely learned for a given task.