Loading
Notes d'étude
Study Reminders
Support
Text Version

Data Mining Technique

Set your study reminders

We will email you at these times to remind you to study.
  • Monday

    -

    7am

    +

    Tuesday

    -

    7am

    +

    Wednesday

    -

    7am

    +

    Thursday

    -

    7am

    +

    Friday

    -

    7am

    +

    Saturday

    -

    7am

    +

    Sunday

    -

    7am

    +

Hello, everybody, welcome to Marketing Analytics course. We are in week 8, session 4 actuallyand we are discussing Market Basket analysis. Till now, we have discussed about the whatis the requirement of Market Basket Analysis and the basic easier version of the algorithm.Now, let us talk about Apriori algorithm which is the advanced version which can be usedfor larger data sets.So, I was here in this presentation, I will just scroll it up. And I was basically justone minute. I was here in the presentation, yes. So, I was talking about find the rulein 2 stages for this particular thing. So, find all items with a specified minimal support.So, first of all you will not do it for everybody, I will only find out those guys which haveminimal support. So, item set is just as a specific set ofitems, like I told when apples and cheese occurred, what was the rule if you remember?That apples and cheese occurs, honey also occurs, so apple and cheese is the item set.So, the first I will choose such kind of combinations, which has a minimum cut off, minimum support.So, what is that? Let say I will say 10, 10 out or at least 1 percent, 1 percent of thewhole data set they should occur, less than 1 percent will not even consider those combinationthat is number one. Now, these use these items is to help generating the rules, so you donot create a rule, you just break based on certain conditions. So, having done stageone, we have considerably narrowed down the possibilities and to can do reasonably fastprocessing on the large itemsets to general candidate rules.So, this is what we do. So, we find out all the products and market of proportion of marketbaskets containing these items and then we do something.So, for an example, terminology says that there are k itemset let us say, a set of kitems. So, it if it is a three item set that means a set of three items, if it is a oneitem set than it is a set of one item, two item set means it is a set of two items. Butits support an items set has s percentage support that means in the data set, thereare s percentage transactions, which has these two guys together. And what is the minimumsupport? The Apriori algorithm starts with the specification of a minimum support.So, we do not use all itemsets, we only use those itemsets which has a minimum numberof occurrences. Let say if my database is 20 million as I told 20 million and I am saying1 percent should occur, that means what? That means basically 200,000 times that itemsetsshould occur, 1 percent is a very big number actually, we do not even say 1 percent, wesay 0.01, 0.01 percent means? 1 by 10001 10 to the power 4. So, 20 millioncomes to be, 20 million point 01 percent is basically how much? 2000, so 2000 times oneitems set should occur, then only we will go ahead and create the rule otherwise, Iwill not create the rule, so, that is the first job.What is a large item set? Does not mean an itemset, which with many items. It means onewhose support is at least had the minimum support that is how we use these large itemsetdefinition. So, L k is the set of all large k item sets in the database. And C k is aset of candidate, so C k is a set of candidate large k item sets.In the algorithm we look at it generates this set which contains all k itemsets which mightbe large and then eventually. So, C k is basically one particular itemset in L k, L k is thesuperset of all large k itemsets in D B and what is large? Large is basically the numitem sets which has support more than the cut off. So, that is something that we willuse this terminologies will use to define the algorithm.Now, here you just see that, which are the three item sets? So, if this is the product,one of the three itemset is a, b, h and a, b, h is occurring here, a, b and h is occurringhere and then a, b and h is occurring here, a, b and h is occurring here. So, ID number1, 18 and 19, in these three cases a, b and h is happening.So, a, b and h is a three itemset, three does itemset, because three items are there. Itsupport is 3 by 20, which is 15 percent, here this which is 15 percent. Similarly, a commai is a two item set who support is 0 percent nowhere is occurring and etc. So, now if Itake the cut off as 10 percent then this guy will be a part of my large, this guy willbe a large item set. So, if the minimum support is 10 percent thenb is a large itemset, you see but b, c, d, h is a small itemset, because this is occurringonly 5 percent but b is occurring 1, 2, 3, 4, 5, 6 is a one item, itemset which is occurring6 time out of 20 that means 30 percent, so which is a large item set, so minimums supportis 10 percent.So, the Apriori algorithm for finding large items item set efficiently in big databases.So, this is the algorithm that is saying, so it find all large one itemsets that isthe first job that means, find out which items are occurring at least 10 percent of timesif that is your cost of. Then for k is equal to 2, while L k minus 1 is non-empty, k plusplus. So, when L k minus 1 is non-empty, it is not blank then increase the k, otherwisekeep the k like that. So, if two items are not occurring 10 percenttimes then three items will not occur 10 per ent times that is it is a, a and b togetherare not occurring 10 percent times, then can a, b, h occur 10 percent times? No. So, ifa and b this is not occurring 10 percent times, that means a and b is not a part of the largedata set, large itemset then a, b, h cannot be a part of the large itemset.So, you will find further find out item sets for only those guys whose individual itemswere there in the earlier version. That is let us say, when you I am trying to analysetwo itemset I will only use those products for which one item set was a part of the largecategory.So, to give an example, let us give an example what I am trying to say here. I am tryingto say, I will just change the page size and etc. Let us talk about it, so I am going tosay that I have item 1, item 2, item 3 item 4 and item 5 and these items. This has 30percent, this is 5 percent, this is 25 percent, this is 9 percent and this is 12 percent fairenough. Now, I am saying that I will find out twocombinations. Now, if 2 has occurred only 5 percent times, then 2 comma 4, can it occurmore than 5 percent times? No, because 2 has occurred. So, out of where 2 has occurred,one subset is where 2 and 4 will occur. So, this will be always lower than this, the percentageof this will always be lower than this, so, I will rather will not see any combinationwhich has this. So, I will probably also not see any combinationwhich has this, I will see only the combinations of these guys. So, that is something whichis important to understand that.okay, so, this guy, this person, this person and thisperson is something that I will consider, because 1 and 3 happening together will bealways be lower than the individual probably so 1 happening together, so this is somethingSo, from here, what are the two item things I will say? Then I will say that 1, 3 I willcheck 1, 5 I will check and 3, 5 I will check, fair enough. Now, let us say 1, 3 is around20 percent and 1, 5 is around let us say 9 percent and 3, 5 is around let us say 11 percent,fair enough. Now, what is the probability that 1 comma 3, 5 comma happened?See one subset on this is 1, 3 another subset of this is 1, 5. 1, 3 occurs 20 percent 1,5 occurs 9 percent, so this guy has to be lower than 9 percent, has to be because outof those cases, where 1, 5 occurs, there are some cases where 3 also occurs, so this guyhas to be lower than 9 percent. So, I will practically not considered thiscombination, because this is the combination which is not a large one 10 percent was mycut off. So, then I will, so, if I do not consider these basically I am not considerother two combinations will also give me this. So, not consider any three itemset. So, theonly 5 only item sets which are there in my these things these are my L 1s and these aremy L 2 , L 3 is 0, L 3 is blank, non-empty, empty.So, only for non-empty data sets it is saying, whenever L k minus 1 is non-empty then onlyyou go ahead, otherwise you do not go ahead, otherwise you do not increase k. And thenyou do the same thing, you find out what are the how many sets are there? How many countsare there? The one that I just showed in the in the in the basic picture, they find outthe count, so they deduce the number of choice sets. So, you can run this particular algorithmin the database and you will find out that these are the three items set which has 10percent support.How do we generate the 4 item sets? While it have 10 percent support, so anybody whichnon-combination of this you will find out and if they have 10 percent support, thenyou consider them as 4 item sets. So, one possibility is that, note all of theseitems are involved a, b, c, e, f, g, h to r. Generate all possible 4 combinations andthen you find out who of them is, but that will again take lots of times. But hold on,we can easily see that a, b, c, e could not have 10 percent support, because a, b, e isnot one of our three item sets. See a, b, e was not at all in our three item set, soa, b, e cannot have 10 percent support, so then a, b, c, e also cannot have 10 percentsupport, if a, b, e cannot have 10 percent support.So, I have to create combinations in such a way such that these guys, so the all thesubset of the original these combinations are there here, all the possible subsets.So, a, b, c has to be there b, c has to be there a, c all possible subset of the fouritem combination should be here, then only we can go ahead and find out that particularguys support, otherwise we will not find out.So, the same goes for several other of these subjects. So, enforce that subsets are alwaysarranged lexicographically and they are already on the left. Only generate k plus one itemsfrom k items which differ in the last item, so that is what it is the algorithm that theyare giving.And this is the only five items will be a, e, g or w and n, q, r, t, f, you can checkthat. So, only the last item there, change in the first three items say I putting itlexicographically and changing the last item.So, this trick guarantees to capture the items that have enough support. Will still generatesome candidates that do not have enough support, so will still have to check them in the pruningstep. And it is particularly convenient for implementation in a standard and relationalstyle transactional data. So, that is what happens in the background of our a Apriorialgorithm.So, there is some example given, I will not spend my time on this example you can checkout and then we try out and find out the rules that what kind of rules. And what can yousay about the coverage of apples and milk? We can invest several potential rules if basketconscious, apples and bananas, it also contains milk, so this is something. So, support ofa, b is 40 percent which where is the confidence of this rule, you have to find out that.So, the items has been taken from these two links, you can also go and read about readmore details from this links. And in the next video we will actually do in a hands on wayhow to deal with this thing. So, some appendix and this thing is also there is the presentation,you can look out and try out on your own. So, thank you for being in this particularvideo, we have discussed the algorithm and we have discussed the usage of marketing MarketBasket Analysis in a quite a bit. In the next video we will actually discuss how to do itin our in a hands on way. Thank you very much.