Loading
Notes d'étude
Study Reminders
Support
Text Version

Hands-On Item Collaborative Filtering

Set your study reminders

We will email you at these times to remind you to study.
  • Monday

    -

    7am

    +

    Tuesday

    -

    7am

    +

    Wednesday

    -

    7am

    +

    Thursday

    -

    7am

    +

    Friday

    -

    7am

    +

    Saturday

    -

    7am

    +

    Sunday

    -

    7am

    +

Hello everybody. Welcome to Marketing Analytics course. This is Dr. Swagato Chatterjee fromVGSOM IIT, Kharagpur who is taking this course. We are in week 7, session 3 and we will discussabout recommendation engine. In this particular video, I will do a hands-on on item to itemcollaborative filtering.So to do that, there is this easy dot r file that I have opened. Easy to dot r becausethere is another file which has been taken from Internet and that coding was very difficultto understand for a new comer. So I have changed that particular coding to a easier form andthe data is about, so the data looks like this. So if I just open the dataso this is a user, if you see it carefully, these are users and these are various I wouldsay FM radio channels or radio songs and 0 and 1 means whether one user has listenedto one particular thing or not. So this is something that you have to create.The user in the left side. The items in the column and 0 and 1 is saying that whethersomething is seen or not, seen by a group of users. So based on that we are creatingthis and then the first thing is to read the data. So we will save working data into sourcefile location and I will read the data. So the data this one is the smaller version ofthat data.So I have taken only 19 users and only 13 variables but this is based on this sampledot csb file which is the small file, 1 kb file but you can also do it with the biggerfile. So you can change it to let us say last FM matrix Germany and you will get the biggerfile and you can do all your calculations based on that. But because I have a time issue,it takes probably 3-4 minutes, 5 minutes in the calculations if I do it in the biggerfile. So, if I do it with bigger file, it will have so just to give an example, so lastFM Germany rename, copy.So if I just read that, you see there are 1,257 observations of 286 variables. 286 peopleare there and 1,257 observations are there. The calculation will take some 1 minute, 2minute and I cannot sit idle in a video for 1 minute, 2 minute. So that is why I am usinga small this thing, you can use the bigger one as well if you like. So, to given thataverage, this data 19 observations of 13 variables, I will start with this. So the first thingthat we will do in the various steps of this calculation, we will create holder. So letus draw it.So what is a holder? The first thing is see, here we have m number of customers and n numberof products. This is what we have. So product 1, 2, 3 upto upto capital N and customer 1,2, 3 up to small n. These are my users and these are my products. So from item to item,what do we do? I find out this item and let us say this item. What is the similarity betweenthese two items? Let us say 3 and 10. The similarity is calculated by this way.Let us say the item 3 is professor X and item 10 is professor Y and these are the studentsand the students has rated professor X as or in this case purchased so let us say student,various student taken the courses of professor X and professor Y. This guy has taken 1100110.Let us say this is the case and in professor 10, the guy has taken 01010110 and in professor11 that is the another guy which has been taken 01001001. This is the thing that hehas taken. Can you tell me by seeing that that which professor is closer to which professor?Who, which two professors are more closer? You will see that these 2 professors whoevertaken professor 3's class, how many times they have also taken professor 10's class?So professor 3's class and professor 10's class, this is a tick. This is a tick. Thisis a cross. This is a cross. This is matching matching,matching, matching. So professor 3 and professor 10 out of 1, 2, 3, 4, 5, 6, 7, 8, 8 guys,6 by 8 times the preferences matches. Whoever seen 3 has also taken classes of 10. Whoeverhas not taken classes of 3 have also not taken the classes of 10. That kind of matching hashappened 6 times. But here this is match. This is match. This is not a match. This ismatch. This is not a match. This is not a match. This is not a match. This is not amatch. So 3 by 8 times. And that gives me an idea that 3 and 11 are closer than 3 and10. So that is how we will be going to create a similarity matrix.Now I will use the cosine matrix to do my calculation. So cosine is cosine of 3, 10is basically the formula is summation of x y. So this is the formula that has been written.I will come to that. Just one minute, summation of h into k by summation of k. So that iswhat we are trying to use. So cosine just 1 minute. Cosine is yeah, so that is whatI am trying to create as my cosine matrix and then we will try to find out. If I haveto find out the similarity of cosine into 3 into 10, what is the cosine will be for10 into 3? Same thing, whatever I get here. So the formulais basically this matrix into this matrix by, so x into y summation of that dividedby summation of x by summation of y something like this. If that is my cosine then I willget the similar thing here as well.So if I get the similar thing here as well then the if I try to plot them in a, if Itry to convert it into a matrix, how will it look like? Each item to each item, so thiswill create n number of products here and n number of products here and this will bea similarity matrix. So n into n similarity matrix gets created. Fair enough. Now whatis the similarity of the diagonal elements? It will be exactly 1. Because I am absolutelyfully similar to me. What and then it will be a either upper triangularor lower triangular symmetric. So whatever is value here, comes here. Whatever valuehere, comes here and so on. So upper side and lower side is symmetric. So one set ofcalculation you do, other set of calculation you do not do, it is okay.So that is how we are reducing the number of calculations. So this is what I will firsttry. And find out how items are closer to another item. So to create this matrix, topopulate this matrix, I have to create the matrix holder. We are naming it as A. A isa matrix where all the entries are 0. Number of row is 12. Number of column is 12. Why?Because there are 13 variables. Now out of these 13 variables the first variable is username, so that is why we are not using. The rest there are 12 items, that is why 12 by12 matrix I am creating.Right now this A matrix is a blank matrix, it is a blank matrix nothing is written here.Fair enough. Now I change to, create change into a data frame and then what will be thename of the data frame? Remember this is the matrix that looks like this. The row nameswill be the row names first of all, the column names of this matrix will be similar to thesevalues, the column names of this matrix and the row names will also be the same.So the row names and column names both will be similar to the column names of the firstoriginal raw dataset.So column names of A is equal to column names of data dot Germany, why minus 1? Becausefirst entry is user that I will not take. So if I just take this part, copy it and runit here, it gives me 13 entries with the first name as user. So this one I will not take.This one I will not take. So that is why I am writing this and I am not getting the user1, the rest 12 is I am getting. I am putting them as column names of A andI am also saying that row names of A and column names of A is same. That means the row namesand column names are same. Now how does A looks like at this moment? A looks like this.All the entries are 0, we will do the calculations but these are the product names, these aresome product names, fair enough.Now what I will do is I will calculate the cosine or here I have calculated the correlation.So correlation is also fine, I think. So how to calculate the correlation? So easy formulais correlation, what is about, so higher the correlation the higher is my I would say thehigher is the similarity. So for i is equal to 1 to 12, for j is equal to 1 to 12 so thisis something for i is equal to 1 to 12 for j is equal to 1 to 12, a of i j that meansi-th row j-th column is correlation of what? Correlation of data dot Germany’s i plus1th column and data plus Germany’s j plus 1th column. Why plus 1?Because the first column is the user column. The first column is user column. So let ussay to give an idea, I want to populate in this a I want to populate this value, letus say this value which is ac dot dc and abba. In the column it is third column, in the rowit is the second row. So what is the case? The third column second row. So i is equalto 2, j is equal to 3, fair enough. Then what is basically correlation i plus 1? So whatis this value? This value is 0 0 0 1 0 0 0 0 1 0 0 0 1 something like that which is basicallythe third column.If you check carefully this is the third column, 1, 2, 3 this column 0 0 0 1 0 0 0 1 and then0 0 0 1 and 0 this is what I am populating here. So this is the third column and withthat j is equal to 3 that means the fourth column I will take a combination. The fourthcolumn looks like this. What is the fourth column?In the data Germany the fourth column is acdc. So these two columns correlation I will tryto find out. That is why i plus 1, j plus 1 and then after putting finding out, I willput them into my a matrices or a datasets i comma jth cell I will populate the value.So that is what I am trying to do and actually I should not have calculated 1 to 12. It wouldhave been better if I have calculated from i is equal to 1 to i rather than 1 to 12,j is equal to 1 to i because see it is a matrix which is lower triangular matrix or uppertriangular matrix. Other things are same, it is a symmetric.So I could have calculated lower triangular matrix and find out the symmetry but I havenot done that because it is a smaller size, it will not take time so that is why I havedone this. So i a i j is equal to correlation of this and this.So if I run three lines together, I get the all the correlation values. See, it is 0.13043,13043 it is symmetric to over the diagonal. All the diagonal elements are 1 and the othercorrelation values are written here, fair enough. Now this is something that I savebecause this is something that I will use later.If it is a huge dataset it is better to save it. Now what will I do next? I, you rememberthat these correlations are between correlation values. Correlation values are between minus1 to plus 1. I am trying to make sure that all of them are scaled properly so that remainfrom 0 to 1. So what do I am writing? I am writing it is called normalization. We havedone this before.The normalization is done in this way. So I am saying that the new item new value isx minus minimum of x where old value is x. x minus minimum of x by maximum of x by minimumof x. Let us say the minimum of x is minus 0.73 and maximum of x is all I know that 1because 1 we are getting. So then this value will be x minus minus 0.73 divided by 1 minusminus 0.73 or in other words, x plus 0.73 by 1.73, fair enough. This is the value thatI get. Now let us say one particular value is 1.When it is 1, the new value will be 1 plus 0.73 by 1 plus 0.73, it will be 1. If thevalue is minus 0.73 then the numerator becomes 0. This value becomes 0 and the denominatorstays. So then I know that okay numerator becomes 0, denominator stays so that meansthat the minimum value is 0. If x value is 0 something in between then the value willbe when x is equal to 0, this new value will be 0.73 by 1.73 or something like that whatevervalue comes here. So it is now covered between 0 and 1, so thatis what I am trying to do here. I am putting them into 0 and 1, fair enough. I am puttingthem between 0 and 1. Now what next? Now I am trying to create a matrix which is a similaritymatrix which is a recommendation matrix which called r. So I wrote r is equal to data dotgermany then r is equal to r minus r. By running these two lines simply what I created is 19,these are my users, these are my items. Okay, sorry, this is something that I probably havedone wrong.r is equal to data dot germany and then I would write r sorry, 13 variables. So 2 to13 is equal to 0 that is the right thing to do. I am sorry. So if I run this then whatdo I get? The users, the items and currently all these items are 0. So I will find outthe recommendation scores for these items. How do I do it? Now what did I do till now?I created only a similarity matrix.If you remember I have created a similarity matrix which is saying that the higher thevalue the more is the similarity, the lower the value the less is the similarity. So Istarted with m users, n items, from here I have created n by n similarity matrix. Thisis what I have created. Now I have created recommendation matrix whichis currently blank which has m users and n items. So now choose any one cell. Let ussay for this cell the user number is 3, the item number is 5. Now for the third user Iwill, my recommendation decision, the recommendation decision R 3, 5, the recommendation decisionwill always be 0 irrespective of anything. If this is the purchase history, the purchase3, 5 is 1. That means if I have already purchased it, I will never recommend. So the first thingthat I will check is that whether this value is 1 in this case or not. If it is 1 hereit will always be 0 here. So that is what I have written here.Carefully you see that for i is equal to 1 to 19 and for j is equal to 1 to 12, 1 to19 means the number of observation, number of customers and j is equal to 1 to 12, thishappens only if data dot germany the original purchase data in that the value is 0. If itis not equal to 1 then only happens, otherwise it does not happen, otherwise this calculationwill not happen. The value will remain 0. The value in the recommendation will remain0. If by chance this is 1, this is not equal to I have written, by chance this value isequal to 1, that means in the purchase data third customer has seen fifth product thenI will not recommend that product anymore. If third customer has not seen the fifth productthen I may recommend based on the calculations that we are going to do but if he has alreadyseen, I will never recommend. So that is why this thing. Now if he has not seen then whatis the question? Then what? If by chance let us say this guy has not seen it, so then ifhe has not seen then what do I do?I simply find out that, okay, item number m if this is the item, which are the top similaritems? So let us say I find out item number m has similar items from this matrix, I willget actually this information. That item number 6, 9, 11, 13 and 17, these five items aremost similar items to item number 5. Now I will recommend item number 5 to user 3. Listento this carefully. I will recommend Main hoon na to my friend.Here we are doing item to item collaborative filtering. So all the similarity is measuredbased on items. So I will recommend Main hoon na to one of my friend only if I have seenthat my friend has also seen similar movies like Main hoon na. Listen to this carefullyonce more. I will recommend one movie, Main hoon na to my friend or anybody if I knowthat that somebody has also seen movies which are similar to Main hoon na, fair enough.So which are similar to Main hoon na? If Main hoon na is item 5 then item 6, 9, 11, 13 and17 are similar to Main hoon na. Now for this 6, 9, 11, 13 and 17 whether theyhave seen it, from which matrix I will get? From this matrix that whether user number3 has seen it. So user number 3 has seen the sixth item will come here, this cell wherethis is 6, this is 3; has seen the ninth item, this is 9, this is 3, I have seen the eleventhitem, this is 11, this is 3. That is how I will find out whether this guy has seen theseitems and I see that okay, this guy has seen this, not seen this, this, this and this.Fair enough. Then his net similarity score is basically1, basically similarity of 6 plus similarity of 11 plus similarity of 13. Let us say 6similarity is 0.73 means similarity of item number 6 with item number 5 is 0.73, itemnumber 11 with item number 5 is 0.42 and item number 13 to item number 5 is 0.33. So thisis his net similarity score for product 3. On the other hand, let ussay in another situation I see that this guy has seen not seen this this, this, this, thisthe rest four he has seen, okay and if the rest four he has seen, corresponding valuesare coming, let us say 0.62, 0.42, 0.32 and 0.29 in the second case.This is in the second case. So you see in the first case the total value is 0.73 plus0.42 that means 1.15 plus 0.33, 1.48 and in the second case it is coming more than 1.48.So in the second case the recommendation, chances of recommending item 5 will be muchhigher though he has not seen the most closest movie. So this is something that is what weare doing. So what I am doing carefully you see. What I do?I do k is equal to sort a comma j. What is a comma j? j is the product. So I will justquickly check this. Let us say i is equal to 3 and j equal to 5, data dot germany ij plus 1(data.germany[i,(j+1)]) what is the value? 0, so I can go ahead. He has not seenbefore. Now what is this a comma j? That means a that similarity matrix j-th column. Theseare the similarity of j-th product that means fifth product with other products. You seethe fifth product similarity with fifth product is 1 this similarity with himself is 1 andthese are the rest of the thing. Now if I try to find out which product is the mostsimilar, I have to sort this out, fair enough. So that is what I am doing, sorting. I amsorting it out and it comes to be the first item to be 1 because I am most similar tomyself but it has no meaning. The next 5 items' similarity is like this, fair enough. So thatis why I take from 2 to 6. So k is equal to sort a, j 2 to 6. So k is basically the top5 items, item number 5 whichever other items are closest to item number 5 correspondingsimilarity scores are this and if I want to find out order, order means the serial number.Serial number of minus a, j will give me the order.Fifth item is most similar to fifth item then fourth item then ninth item then first itemthen third item and so on. So these are the order, this is not the similarity score. Thisis the order of the similarity score. So that 2 to 6 if I put it in l, this gives my identity.So fourth, ninth, first third and sixth items are the top most similar items to item number5 and corresponding similarity scores are these. So k stores the similarity scores,l stores the similar items. Now I have to check out of these whateveritems has been stored in l whichever has been seen by the i-th customer, in this case thesecond customer or third customer, i is equal to 3, the third customer whether he has seenit or not. So how will I know? This data dot germany i comma l plus 1, plus 1 because thefirst column in the data dot germany dataset is the users. So this is my history.The history is says h is 0 0 0 0 0. So he has not seen any of these movies. Then howmuch will be the probability? 0 because 0 is my history into the similarity score bythe sum of the similarity score. So that is something that we calculate and put it inr i j plus 1, so the value becomes 0. This particular value will still become 0 becauseall my h items are 0. If by chance one of them were 1 any one of them were 1, correspondingvalue would have been calculated.Now this is the operation that I do for all my products and all my users. If you did notunderstand this, please stop the video, go back and understand it properly. So I runthat. Now I have created the r matrix, some of the values are coming 0. 0 can come fortwo reasons, one is this guy has already seen the movie before that is why it is coming0 or this guy has seen no similar movie. We have taken the top five, no similar top fivemovies, that is why the value is coming 0. These are the two reasons why these valuescan come 0. Whenever he has seen at least one similar movie then some value will comehere. Now what I do is next, I will create the recommendation. So these are the recommendationscourse. I need actual names of the movies. So what I create is I create a recommendationmatrix which has 19 rows and 5 columns. Why 5 columns? Because top 5 recommendationwill come. In normal case in Netflix and etcetera 5 recommendations come so top 5. The row nameswill be the user names. So that is how the user names are coming here.These are the user names and V1, V2, V3, V4, V5 will be the name of the, so the names areReco1, Reco2, Reco3, Reco4, Reco5 the 5 recommendations for these users and what will be the recommendationsfor i is equal to 1 to 19, carefully see for i is equal to 1 to 19 what do I do? I firstI find out the corresponding entry of r. Carefully you see, let us say I am trying to find outthe first guy's recommendation. So this is what I have created before the recommendationmatrix. So if I just find out what is the first row, this is the user, forget aboutit and then these are all the values. Now this is all 0, so this will not help me.These are all values are 0 but let us say the second guy let us say i is equal to 2so then what is r i? The user name is 33 and he has 0.16 value here 0.38 value here and0.59 value here. So his chances of seeing alexisonfire is the highest then comes acdot DC and then comes a perfect circle. So these should be the recommendations for me,fair enough. So how will I do that? I will order this basically.So what I am doing here is order i-th cell and 2 to 12 because first entry is user, sothat I am ordering in a reverse that means in a decreasing order. So this is my wheni is equal to 2, this is how it looks like. 9, this alexisonfire was 9. Ninth entry thenthird entry then first entry and so on. Then this plus 1, why I am doing plus 1? BecauseI am trying to find out the names of these movies from the original dataset.The original dataset's first column was user, that is why plus 1. So tenth cell, eleventhfourth, tenth column, fourth column, second column, third column and fifth column willgive me those columns names will give me the corresponding movie names. So then what doI do? I am saying that if the column names of thisthat means these are the column names, alexonfire, ac.dc, a perfect circle and then abba ,adam.green,these has been taken, these are all 0s, so this has been taken and all these. So thesethree matters basically and then I am taking the top 5 only, 1 to 5 because I am givingfive suggestions only. So the last two suggestions are random, the first three suggestions inthis case because there were some positive value, I am getting some suggestions.So these are my suggestions. So now if I run it for all of these guys, that populates suggestionsfor each of these people. So for 33, this will be the suggestions, for 42 this willbe the suggestions, for 62 this will be the suggestions and so on. So that is how I createa item to item collaborative filtering. So once more, what did I do, what are the steps?First I had this, I will just do once more. First, I had m items, sorry, m users and nitems, purchase data 0 1 purchase data. From there I calculated n by n similarity matrix.Then I calculated a this is purchase data, this is recommendation matrix of m users,n items. Here the values are the recommendation scores. How the strength of the recommendationand based on that for each m users, we suggested R1 to R5 top 5 recommendations based on thescores, whichever score is higher will be the first recommendation, whichever scoreis lower will be the second recommendation and so on.So these are steps that I have created in item to item collaborative filtering. So wehave done it for a smaller dataset. There is a bigger dataset called lstfm matrix Germany.This is available in publicly available dataset. You can use it and you can try out and findout that whether you can create a recommendation engine with the all 1,257 of these thingsand 286 users and 1,257 items you can create. It will take some time. Same thing, it willtake some time but you will get a result for that. So thank you very much. I will comeback with user to user or user based collaborative filtering. Thank you.