Loading
Note di Apprendimento
Study Reminders
Support
Text Version

User Collaborative Filtering

Set your study reminders

We will email you at these times to remind you to study.
  • Monday

    -

    7am

    +

    Tuesday

    -

    7am

    +

    Wednesday

    -

    7am

    +

    Thursday

    -

    7am

    +

    Friday

    -

    7am

    +

    Saturday

    -

    7am

    +

    Sunday

    -

    7am

    +

Hello everybody, welcome to Marketing Analytics course, this is Dr. Swagato Chatterjee fromVGSOM, IIT Kharagpur who is taking this course and we are in week seven and we are discussingRecommendation Engine. So, till now we have talked about item to item collaborative filtering.In this particular video, I will also talk about user to user collaborative filtering.So, how can I find out whether the two users are similar or not? And how we can deal withthat?So, we will use the same dataset but there is something called user base cf dot R file.So, I am just opening that file using the same dataset, and we will here we will createhow two users are similar to each other and based on that, we will probably try to findout that whether certain movies or in this case certain I would say songs can be recommendedto a particular customer or not.So, the dataset if you see the dataset is same, so session set, working directory, twosource file location that is the first job that we want to do. And then I will read thedataset which is same sample dot csv file. It is a smaller version of the bigger datasetwhich has 19 observations and 13 variables. So, 19 observations of various users and 13variables are actually 13 columns. Out of them first column is the user column itself.And from then, this column till the last column, this all these columns are actually whethera particular the movie or a particular in this case a song has been consumed, has beenlistened by a user or not. So that is something that we are trying to do here.So, now these 0’s and 1’s means 0 means that he has not seen the video and 1 meansthat he has seen the video or in this case probably the he has listened to the song andwe are going ahead with that.So, the first things first is we will check only for such kind of cases whether rowSums,you will see that rowSum of data Germany 2 to 13 has to be greater than 0. So, what isrowSum? rowSum means that by chance, if there is a dataset, where rowSum means the summationof these rows, so, row wise summation, each row, you take each row one at a time and takea summation of the values.Now, if all the values are zero then this guy has not seen any of the movies in thisparticular list. If one user has not listened to or not listened any songs or listened toany movies in a particular list, then even if you get similarity matrix and blah, blah,blah, whatever you do not have any historical data for that person, and if you do not havehistorical data and in this context, we do not also have the demographic data of theperson. So, if I have no data about that person, I cannot recommend him anything.So, the first small job that we are doing here, which was not the case in the previousthing, we are doing column wise so, we have to check probably that column wise if thesum is zero, then that particular movie is not seen by anybody. So, similarity of thatmovie with any movie is zero. So, we could have dropped that movie so, that is one way.So, that is the problem of collaborative filtering as we discussed in the first video in thisweek. That collaborative filtering will not work when the data is new, when the user isnew or the movie or whichever the product ID is new, if the product is new or the useris new, you will not get historical data of that person and user based collaborative filteringor item based collaborative filtering actually relies on your historical data. If historicaldata is absent for you, then you cannot use that in a collaborative filtering.So, now here we are saying that rowSums greater than zero for whom? So, if I just copy thatpart, the part that I have highlighted and then paste it here and press, see there aresome guys who are false. So for example, for the first row or the 13, 14, 15, 16 row orthese guys so, say some people in this dataset who has not seen any movie or not listenedto any music in this particular case. So that means that I have to drop them, before I goahead with doing any kind of modeling, so that is why I am saying that if by chancethey are true they only you take those row so see, I am doing yourself sub setting here,if you see that data dot Germany, what did I write here? Data dot Germany, this is whatI have written there, data dot Germany and then the third bracket, that is what I havewritten there, this is the third bracket.And in the third bracket means that subset of data dot Germany then I have written acomma, comma means some rows, some columns, nothing written after comma that means allthe columns, right and what is written before the row? In the rows, this is what it is writtenin the rows. So, wherever these value, the highlighted value is coming true, you aregetting your result, wherever it is false, that particular guy is being dropped.So given that if I just run this now, I got 15 observations from 19 observations it gotdropped to 15 observations. So, that is fine. So, after that I can do my mathematics. Sothen what I am doing is I am trying to find out correlation, you can also find out cosineor whatever measurement that you want to take for, correlation is the easiest one and weunderstand that in marketing, but in the real world, cosine is the most prominently usedmatrix for finding out the similarity between two groups or in this case two vectors. Socorrelation of T of germany transpose, why I have taken transpose? Because now I am takingrow wise.So, if you remember, if I have, this is my data set. And if I have n number of usersand m number of products, so these are 01, 01, 01 rows, earlier I was doing column wise,there I was doing like this column wise. Now, it is 0 1 row, so 0, 0, 1, 1, 0, 0, 1 then0, 1, 1, 0, 1, 1, 0 or so on something like that, so, this guy and this guy the user oneand user two are similar, if user one and user two are similar, if their purchase behavioris similar, now you can measure, check this is 0 1 column, right, row means vector, 01 vector.So the similarity can be measured by various ways, it can be measured by correlation, itcan be measured by cosine, it can be also measured by let us say Standardized Euclideandistance, Euclidean distance. So, all this is applicable so, which ever you want to do,you can do it. So, here what I am doing is I am doing coordination.So, I am doing transposition, transpose of data dot Germany comma minus 1, means firstcolumn you drop, first column gets dropped, that means this particular column gets dropped,then whatever this guy had, I am taking a transpose of that. So, that looks like, thatlooks like if I just copy this paste it here, that looks like this. So, the column, themovie names came here, and this is the customer guys. And these are his purchase behavior,his purchase behavior, third guy’s purchase behavior and so on. So, for this and thisif I take 0, 0, 1, 1, 0, 0, 0, 1, 1, 0 and so on, and then 0, 0, 0, 0, 0, 0 and so onfor the third person that means the second guy in this case.If I take a correlation between that, that will give me the similarity between thesepeople. So, if I just try to find out correlation using the function called COR C, O, R whichis available in the base package, so, COR will be the correlation function, the syntaxis COR and it is saying that okay, if you give the whole dataset in COR, you can givex whichever, x is what? x is a numeric vector, matrix or data frame. If it is a numeric matrixor data frame, it will find out each column and take the correlation of those columns.If it is a vector, then you have to give another vector. So, you can do that and I am runningthis and saving the result in a so, if I run this a looks like this.So, this is basically the values that you are seeing here or basically correlation valueswith 2 and 2 the correlation between 2 and 2 is 1, 3 and 3 is 1, 4 and 4 is 1, 5 and5 is 1 and so on. And all the off-diagonal elements are basically mirror image over thediagonal that means, if this is 1.13487, this is also 1.81 minus 0.183483997 actually thisis also 3997 and if it is 0.4 this is also 0.4 and so on. So, those kind of values youwill get.Now, I am converting this a to a data frame because that helps me in doing further some,some amount of analysis and some extra advantage data frame has over matrices, we have discussedabout that in a different class. So, that is something that we are doing, we are changingit to a data frame. Now, after changing it to a data frame, what I am doing is, I amactually putting the column names also. So, see here the user names are not coming properly,it is two, three, these are not exactly the user IDs, these are actually the serial numbersof that particular column.So second column, third column, first column got dropped, then second column, third columnetcetera that 2, 3, 4 has come.So, I am saying columns name of a is data dollar dot Germany dollar user, data dot Germanydollar user is actual user IDs of those people, user id number 33, 42, 51, 62 and etcetera.So, that should come into the column name and also in the row name. So, if I run thesetwo lines, first, the column names gets changed, then when I write row names is equal to columnnames then the row names also get changed.So, that is what we will get here. So, now you see it is 33, 42, 53, 62 and so on andhere also 33, 42, 53, 62 and so on so, that is what I am getting here, in this particularcase. So whatever it is just to make sure that I can visualize it properly, I can seeit properly. And I can understand that. So, what did I get till now? I got, instead ofprevious one while we are doing item to item collaborative filtering, and I was gettinga similarity matrix for one item over another item, here I am getting similarity matrixfor the user, one user towards other users. And then I will have to find out who are thetop five users.So I am maximizing, I am putting a normalization, if you remember, the normalization that wedid before is normalized x is x minus minimum of x by a maximum of x minus minimum of xwhat does it do? It actually binds the normalized x into 0 and 1, so the minimum value comesto be 0 and the maximum value comes to be 1. So, that is what is also happening here,this is something that we are trying to do. So, if you check here, I am doing is a isequal to a minus minimum of a and now it is a matrix. So, it will find out the minimumof a in the whole matrix by maximum of a minus minimum of a.So, the maximum entry in this currently in this is a 0 and minimum entry is 0 and maximumentry is 1 and that is how it has changed the whole a column.So, if you see this, you will see that all the values have changed, there is no negativevalue anymore. So, no negative value means that we have to track that in a similar way.Fair enough. So now, the next step is to create the I would say the movies importance valueor something like that. So, I am creating a recommendation matrix called r, r is equalto data Germany,why? Because then the dimensions remain same. And then r is equal to r minus1 means whatever entry there was in r you actually remove them, all of them and thenr dollar user is equal to data Germany dollar user that means basically, it looks like this:So, all these values are zero, the column names are similar to data dot Germany, thefirst column is absolutely same to data dot Germany which is user. Now, what I will dois, I will find out how much is the, for each user let us say user number 33, how much ishis probability or propensity or attractiveness towards a perfect circle? Then how much isfor a b b a, how much is for ac dot dc and so on, I will populate all those and thenI will sort them up each row at a time.So, I will first populate so, if you think that, okay, 33 should see 10 movies, theseare the various 10 movies at various levels of attractiveness, these movies are somethingthat user number 33 should see, then those are the movies where some values will comein this particular matrix. And then when I pick up the only the row of 33 and then sortfrom highest to lowest, I get which movie has the highest one, which movie is the lowestone, this part is similar to the previous one but remember there were small tweaks inthe presentation.So, what I am doing is carefully you see, for i is equal to 1 to 15, why 1 to 15? Becausethere are 15 users at this moment, and for j is equal to 1 to 12, why 12? Because thereare 12 movies at this moment and in this r also you will see that there, if I removethe first column, this is actually 15 into 12 matrix. So, I will populate that so, fori is equal to 1 to 15 and for j is equal to 1 to 12, if the same thing data dot Germanywithin bracket i comma j plus 1 data dot Germany i comma j plus 1 means, his historical purchasedata. So, whether ith customer has seen jth movie, why j plus 1? Because first columnin data dot Germany is the user column.So, j plus 1th column is actually talking about the jth movie. So, i comma j plus 1that means ith row, j plus 1th column, what is the value there in data dot Germany? Ifthat value is 1, that means ith person has already seen j plus 1th movie, then I willnot recommend him, then corresponding value will remain 0 in my r matrix or in this callr data frame, it will not change, exactly what happened in the previous video, but bychance if data dot Germany i comma j plus 1, if it is not equal to 1, if it is not equalto 1, then the rest of the part comes into the picture, what do I do?I find out that what is the historical purchase data of this person, of ith person and thenI see that okay, so carefully see, so we do not do that, why? We see that what is thepurchase data of a, means the person who is closest to the ith person, so carefully see,let us say i is equal to 1 and j is equal to 1, I am just assuming so, then data dotGermany, if it is 1 of 0, I will say so this is true. So, then that will not work. So letus j is equal to 2 then not equal to 1, okay, it is not equal to 1, sorry. So, j equal to1 is what we will try, sorry. So, this is what? So, it is true that means it comes intothe loop, i is equal to 1, j is equal to 1 currently.If i is equal to 1, j is equal to 1, what is a comma i? That means a is ith column.This is the correlation values, i is equal to 1, that is why his correlation with himselfis the highest, no issues in that and then these are the other correlations that thisperson has. Fair enough.Now, if I sort this in decreasing order, that means from highest to lowest, I get this 1comes at the top and then other values. Now, I will not take 1 because this is correspondingto himself, I will take 2 comma 6, the next 5 based so, these are the top five similarityscores for ith customer with its neighbors, that means with his most similar customersso, these are my similarity score with me and my more similar customers. Let us saythere are five professors who are most similar to me in terms of publishing certain papersin certain journals, whatever I publish, they also publish.So, there are five such professors who find who are matching my interest level, my journalpublication history, blah blah blah. So, they are most closest and this scores are thosecloseness score. Now, what is next? I will actually find out their order also. So, Iwill find out who they are. So, these are the purchase, this is the basically a commai to comma c, that means this is actually their corresponding column numbers. So, theclosest is 10, see here it is written if it is 8th, 9th, this is the 10, this original,original matrix of a i, original matrix of a i.If I just check this first column, that 10th column is 782779 which is the highest, thatvalue is coming here and corresponding I would say position is coming here. Similarly, thesecond best is fifth column, 0.7549 that value is coming here and correspondingly the positionis coming here. So, that these are the positions and these are the similarity matrices. So,l is the position and then what is h? Now, this is something which you have to understand,what is h? h is the historical purchase data of this lth person, l persons. So, I willsay what is h? If I just copy this, this part and paste it here l is basically which arethe movies that these people have seen.So, these five people l comma j plus 1, l comma j plus 1 means, that whether these guyshave seen this j plus 1th movie or not.So, understand carefully what I am trying to do. I am finding out the first step, whoare closer to me, that is my first question, that comes in the a data frame, fair enough.Then I check that, ith customer jth movie whether he has already seen, if he has alreadyseen, no recommendation, if he has not seen then further analysis, what is the furtheranalysis? Check costumers close to i have seen the movie or not, movie j or not so,whether customers close to i have seen movie j or not. So, that is my next question. So,whether customers who are close to me have seen this particular movie or not.So, if they have seen this movie, then I will see this movie. If none of them have seenthe movie then I will not see the movie. So, let us say who also this question, if I haveto answer this question then first I have to find out who are the customers close tome, let us say customer I dash 1, i dash 2, i dash 3, i dash 4 or i dash 5,(I’1) theseare the customers who are closest to me. So in the case of customer number one, in thecase of customer number one, the closest customer if you remember are customer number ten, thencustomer number five, then customer number ten, five, three, two, four.So these are if I am checking for customer number one, ten, five, three, two, four, theseare the customers who are closest to me. Now, I have to check whether they have seen thejth movie or not, jth movie means the first movie. If they have seen the jth movie, thenI will do something, if none of them seen the movie then there is no recommendation.So, let us say, that is what I find out the 10th person has not seen, this person hasnot seen, this person has not seen, this person has seen, this person has not seen, this personhas not seen, how will I get this data?I will get this data from my data dot Germany, see data dot Germany l, l is the IDs, customerIDs, l row number, and j plus 1 column number, why j plus 1? Because first column was theuser IDs. So, l comma j plus 1 that is giving me the sale values, while the column numberis j plus 1 and row number is l, now l has five entries so, all those five sale valueswill be given and I am changing into numeric value.So, what is the numeric value? That numeric value is 0, 0, 1, 0, 0. That means that 10thguy has seen it, 5th guy has not seen it, 3rd guy has seen it, 2nd, 4th, 10th, 5th nobodyelse have seen and then what I am trying to calculate? I am trying to calculate the recommendationmatrix, how strong strongly I should recommend? I should strongly recommend if see, 0.78,0.75, 0.6 this is 0.78, this is 0.75, this is 0.6 and then come 0.24, 0.24 okay, 0.24,0.24. So basically, I am trying to say that whether I should be recommended this particularmovie or not will depend on the multiplication of this historical data and the, the k, whichis the weightages. So this is my actual observation, this is the weightages so, create a weightedaverage.So, this into this, this into this divided by summation of all of these things. So, whatI am getting basically, I am getting 1 into 0.6 at the top, and then summation of allthese things in the numerator, by chance think about a situation, by chance if instead ofthis one person, let us say two persons have seen the movie, by chance two persons haveseen the movie, how will you deal with that kind of a situation? Wait a minute, let mejust delete this.So, by chance, let us say, instead of this particular person, by chance, if this personwas 1, that means that your closest friend, your closest user has seen the movie, obviouslythat should increase. On the other hand, let us say instead of this person seen the movie,this 1 into 0.78, what would have been the case if this is this was 0 but let us saythis was 1, this was 1, this was 1? Then more people probably your closest person has notseen, but more number of people who are similar to you have seen, so that is why the numeratorwould have been 0.6 plus 0.24 plus 0.24.So, that numerator is basically the measurement of the how strongly we should recommend andthat is coming up from this particular thing. r i comma j plus 1 is basically sum of h intok. h stands for the historical purchase history of users who were close to you. k is basicallythe similarity score by sum of k so, this is where I run for the whole thing. It isquick, it has given me some scores, zeroes mean either nobody who is close to me hasseen these movies.Remember in the item to item, whether I have seen Shahrukh Khan’s others movies, letus say you are trying to recommend me, you are deciding whether you should recommendMain Hoon Na to me, and you were checking that whether I have seen Main Hoon Na, ifI have seen Main Hoon Na, no recommendation. If I have not seen Main Hoon Na then whetherI have seen other movies which are close to Main Hoon Na, here story is different.Here if I have not seen Main Hoon Na then you will see that whether other users whoare close to me have seen Main Hoon Na so, that is what you are saying, all the zerosmeans no other users which are close to me, who are similar to me have seen Main HoonNa, that is why it is coming zero or I have already seen that particular movie, that iswhy it is coming zero. Otherwise there is some score coming and those scores what willI do?Simple, those score is something is as usual, I will find out the row names, column namesand so I am creating a Reco matrix, which is all my usernames and Reco 1, Reco 2, Reco3, Reco 4, Reco 5, just five column names. And here I will populate, what will I populate?I will populate the column names, I will pick up one particular so, from this r column ifI am doing it for the first column let us say.So, r comma 1, sorry 1 comma, this is what it looks like. Fair enough. Now, if I sortit up, this is how it looks like, the lower is zero, the higher is this number. If I sortit up not everything, but let us say r has 13 variable so 2 to 13 then the user, thisone is vanished, this user one is vanished, last column. So, I get these are basicallythe recommendation scores. So which one I will recommend? These five basically I willrecommend, the last five. So what I have done, I am doing an order, instead of sorting upI am doing an order.An order is giving me the basically the position ofthis lowest to highest and etcetera and I am using that position to find out the columnnames, so order of this is giving me the position here, when let us say i is equal to 1, wheni is equal to 1, the order is basically giving me the position.And I am taking the top five, because I will recommend only five. So, I am taking the topfive of them. And then with that top five, let me come back. So, I am taking up the columnnames of data dot Germany. So, you give me the column names of data dot Germany for thosespecific positions. So, these are basically for first customer, these are the suggestionsbut I will take the top five, that is 1 to 5. So, this is the suggestion for i is equal1.Similarly, i is equal to 2 it will change, i is equal to 3 it will change. So I am justrunning that in a loop. So if I just run this, the recommendation table is populated now.The first one is the top five based on the r matrix. The second one is the top five basedon the second row of r matrix and so on. And that is how you are creating recommendationengine which is user base. So what is the difference basically? Once more it is itemto item similarity that we are checking in item based collaborative filtering, this userto user similarity that we are checking in user base in similarity matrix.We will do some more examples in the next videos. Thank you for being with me, I willstrongly suggest that you should go ahead and check and run these codes once more withthe bigger dataset. It might take some time but you will find out more nuances, sometimessome errors also, let me know whether you are getting stuck in somewhere and we willtry to solve that. Thank you very much for listening to this video. I will see you inthe next video.