Hello, everybody, welcome to Marketing Analytics Course, this is Dr. Swagato Chatterjee fromthe VGSOM IIT, Kharagpur who is taking this course.We are in week seven, and we are discussing Recommendation Engine.So, in this video and probably in the next video also, we will discuss certain examplesof recommendation engine and their applications with real world data.So, in this particular data set, here, we have movie ratings and where people have givenratings for movies.And I also have genre of the movies and I want to find out how I can recommend certainmovies based on this genre and based on this, I would say interest towards those genre wheredata is being collected from let us say, IMDb or let us say Netflix or something like that.So here, the dataset, if I just read the dataset, the dataset looks like this, there are twodatasets which we will merge.One is called movies dataset where the movie ID, the title of the movie, and the genresare written here.And then the ratings dataset, where basically around 1 lakh ratings are there, 1 lakh 5thousand ratings are there, user ID, movie ID, rating and timestamp at what time pointthis guy has given the rating.So I will actually find out, so, there are basically if I am not wrong, ratings dollaruser ID, if I am trying to find out and then find out the unique of them.So, how many unique users are there?If I am not wrong, there are 668 unique users, yes.So, there 1 to 668 unique users are there and how many movies?Basically, there 10,329 movies.So, that is what I have.So, now, based on that, I have to find out which movie should be recommended.Now, if somebody has already rated a movie, then you will not see the movie anymore, thenI have to find out his similar movie which he might like and likeness data, whether hewill like or dislike a data, a movie will come from basically the rating that this guyhas provided.So, I will ask for a library called recommender lab and ggplot2, recommender lab whateverwe did before, the item based and user based collaborative filtering, this guy will doit on its own quickly.For bigger data, it helps so, recommender lab and ggplot2 is two libraries that I amcalling, if you have not installed it, you have to first install it and then and howto install it?You can write install dot packages and then the package name or you can go here and installand write the package name here and then that will install the package.So, I have installed the package beforehand and I am just running this thing here.So, the first thing is processing of the data, I will break the data into genres.So, for that what I am doing is I am creating a genres column, so which is a genre column,only the genres of my dataset and then I am using a library called data dot table.And then in with this library, I am breaking the genre with c, there is called string split.So, this string split will actually split the screen based of the genre comma 1, thatmeans genre’s first column, based on this particular sign.Whenever this sign occurs, it will just split it and then stringAsFactors is equal to falseand it will create a dataset.So, genre2 gets, what it does?It is actually breaking if you check genres if it is adventure, animation, children, comedyand fantasy, so adventure, animation, children, comedy and fantasy so, that is how it hasbroken and the rest of them are blind.So, we have created and if I just column names of genres 2 if I make it 1 to 10 now, it willbe clearly seen that I am actually breaking the genres of each column into separate separatecolumns.So, these are my various genres.And if I will find out the unique of these genres in the whatever they are in this 1000,3000, 10,329 observations of 10 variables, you will basically get 668 sorry, 18 differentkinds of genres.So, these are 18 different kind of genres that is coming up.So, that is a genre list that we have and based on this genre list, I will create amatching so, whether what is the code for this particular genre?So, if this is the first particular movies genre, so then I will get a matrix in sucha way such that that matrix will have number of movies is 10,329.So 10,329 rows and 18 columns, 18 columns each has one genre.And the value will be zero if that particular movie falls in that genre, and that data willcome from this particular matrix.And if it is not in that genre then 0, if it is in that genre then 1, so, that is whatI am going to create now.So what I am doing is that I am creating see, there is a matrix of 0 with 10,330, that meansfirst row will be the column name and 18 columns.So that is what I am creating, the first row genre matrix 1, the first row is basicallythe genre list, the column names and then column names of genre matrix is basicallythe genre list so, that is what I am doing here.So, if you check the genre matrix right now, it is basically all zeros, okay.And then what I am doing is for i is equal to 1 to n row of genres2 that means 10,330and then c is equal to 1 to n column of genres2, that means 1 to 18, what will I do?If it is matching then 1 otherwise 0.So, if I just run this thing quickly and then drop the first row, see first row I am droppinghere, if I just drop the first row and what do I get in genre matrix2?I am getting genre matrix 2 like this.So, the first movie is action, animation, children, comedy, the second movie is adventure,children and fantasy and so on.Now, this part, you can do it in different way.If you have a better algo for that, you can do it in your way.This part is similar so, I have to create this, this is something that I want to knowthat which movie and with genre I am creating a mapping for that.Now, next is for c in 1 to 18 n column genre matrix2 is 18, so, c is 1 to 18 the genrematrix for each column I am changing it to their integer values.So, if I just run this, is doing nothing but changing all values to their integer values.Now, what will I do?I will check, create a search for a movie by genre so, this is an off topic, but thisis something that helps me in developing further.So, I am creating a data frame called years which is the movie title.So, years is equal to as data frame movie titles, string as false so, years looks this,so, this is the movie title column basically.And then I am using library called data table, to what to do?To substring so, I am saying that substring write is a function, which will substringfrom left to right.So, this is what I am doing substring function and using the substring function from here,I will only take the years this 1995, this 1995, this 1995 I will scrape so, you cando right how much?From right side it is like the right function, if you have used Excel, it is the right function,on the right it is the second to 1, 2, 3, 4, fifth.So, it should start from 2 in that 5.So, from right side that is what I am doing.So years is equal to you see that 1 comma 4 basically is something that it is doing,it is subtracting, creating a subset and that years is getting me all this year values ofthis particular thing.So again, you can do it in Excel if you want, but I am doing it here.Next is I will create the same thing with a movie.So I am getting a search matrix, which is binding movies data first column.That means movie ID, next is substring of movies one, two, so the movie name, basicallyonly the movie name, then the years and then the genre matrix.So, when I put this and then put the whole column names, this is the search matrix thatI get a movie ID, the title, the year, and then the genre id matrix.So, this is what I have created from this.I save this so that I can use it for later purpose.So this is how it looks like basically, the search matrix will look like movie ID, title,year and the overall genre of the movies.So, you can create a pivot in this particular matrix or you can create, you can search likethis subset, search matrix, action is equal to 1 that means it is a movie action, andyears is equal to 1995, what is the title of that?If I find out, so these are the movies, which are, which come out in 1995 under action movies.So if you want furthermore, let us say, I want action is equal to 1 and comedy is equalto 1 and 1995.So, these are the movies which are action and comedy.So, you can create those kind of search.Now, what I will do next is I will create a binary rating.So, create a user profile is what is my next objective, so I am creating binary ratingsanything 4 and 5 is high, 3, 2, 1 is low.So, that is what I am creating here so, binary rating is equal to ratings and then if thevalue is greater than 3 then put it 1, if it is smaller than 3 then minus 1.So, that is what I just trying to do.Likes means 1, dislikes means minus 1 so, it will run for some time.And let us see how much time it runs.So, i in 1 to n row binary ratings, binary ratings is 1 lakh.So, it will take quite a bit of time to totally find out so, I will see it is now 28,000.And what I am trying to do here, basically it will go up to 1 lakh.What I am going to do is that sometimes it does not matter if there are multiple gapslet us say 1, 2, 3, 4, 5, multiple levels of ratings, creating similarity becomes verydifficult sometimes.So, 1 minus 1 is basically a more or less I would say dichotomy.So, if you like 1, if you dislike minus 1 and why did I take greater than equal to 4,4 or 5?Because oftentimes you have seen that the reviews are a little bit positively skewedin nowadays.So, if it is positively skewed so, 1, 2, 3 is considered to be low and 4 and 5 is consideredto be high and we are actually trying to do that we are in 98,000, 105000 will take okay.So, it is over so, I have created a binary rating.So, how does the binary ratings look like?Binary ratings look like this, basic thing 1 or minus 1 nothing else.So, this code is actually binary code you could have done a if else, simple if elsewould have taken much lesser time, anyways.So, then what I do is convert this binary rating so, what I will do is I will convertbinary matrix with correct format.So, the format is like this binary ratings2 which I run and it will create a binary ratings2,which is the movie ID.And then correspondingly, there are lots of people 669 variables means 669 observations,all these NS means nobody has given review for that.And if they have given some review either it is positive or negative, 1 means positives,minus 1 is negative.So, that is what I have got so, basically 10,325 movies and 669 users, I will got a0 1 matrix for that.So, it is a large matrix actually and then, so, if by chance, if there is nothing there,then if there is NA, then put a 0 there basically.So, I am putting 0 for all the values where it is NA and then remove the movie ID column.So, that means I am getting up 10,325 observations, which is the movies, 668 observations, whichare the users and 0 1 and minus 1 this is the three values that are there, 0 means therewas no rating, 1 means he liked it, minus 1 means he basically disliked so, that isa binary rating formula that I have created.Now remove rows that are not rated from movies datasets, there are certain movies which areabsolutely not rated by anybody.So, if they are not rated by anybody, why should I use them?So, we are doing that and then movies rows that are not rated for genre matrix2 alsoso, any movies which has been not rated you have to remove that from genre matrix2 also.That is why I am creating genre matrix3, so, genre matrix2 to 3 there are four movies,which were not rated 10,329 10,225 so, that means there are four movies which were notrated, we removed them.Similarly movies2 basically have 10,325, therefore movies which were not rated by anybody, weremoved them.Fair enough, that is how we are reducing the dataset, so the calculation takes less time.Then what, calculate the dot product of the genre matrix and the ratings matrix and obtainthe user profile.So, understand what is the genre matrix.Genre matrix is 10,325 of 18 genres, whether this movie is belonging to this genre, andthen what is my rating matrix?Rating matrix is basically 10,325 of various people giving a rating.Now, if I want to find out that whether this user is fond of comedy, then what will I do?I have seen that this user, this 668 from this binary matrix, this one, if I just clickon this one, let us see user number one.Let it come and then I will say, let us say user number one whether he so, user numberone if you say that 10,325 or let us says user number two because the data is showinghere, he has seen this movie, he has not liked this movie, he has not liked this movie andif you come down further he has liked this movie.So, some movies user number 2 like, some movies user number 2 did not like.And then all these movies, like movie number 14, if I go to genre matrix, I know that movienumber 14 is basically a, movie number 14, movie number 14 is basically is a drama movie,which he liked.And then it is only a drama probably.Then, let us say this one, which he did not like, which one was that?That was movie number 3, or movie number 5 he did not like.So, in the genre matrix I know movie number 3 is basically a comedy, movie number 5 isalso comedy.And movie number 3 is nothing else.Probably romance also, movie number 5 is not romance.So, movie number 3 and 5 both are comedy which the user number 2 did not like, so from thisinformation I can find out what is that he is net liking, the overall liking of a genre,how would I do that?Which movies he has seen, which genre it is from there, I will find out how much is ratedfor them, plus 1 or minus 1.So that is what I am going to do here in the next set of calculations.So, I am creating a result which is 18 comma 668, why 18?That means 18 rows, 668 columns, each column is one consumer or one user.And I am saying that whether he, basically a sum of 1’s and minus 1’s, whether hehas seen a movie in a particular genre, if like liked 1, if he disliked minus 1.If the net sum is positive, then overall he is liking, if the net sum is negative thenoverall he disliked in that genre.So, that is what I am populating here, quickly so, result, I got this result column.So now, if I just see this matrix carefully, what I get is user number 2 do not, user number1 likes the first genre most 28, then sixth genre, the eighth genre and probably 16thgenre.These are the genres that user number 1 likes, user number 2, probably this one, second genre,ninth genre.Now, if by chance see there are two guys who are, it is user number 1 is a movie buff,he watches more movie, so the one which one he dislikes, likes a ninth is still much higherthan the most liked one for the second user, which is 9 here.So he is 9 and this 9 has two different meaning.For the user 2 this 9 means that he likes the eighth genre most but he does not watchmovie much.On the other hand, this 9 means that he watches movie very much.9 rating is actually in comparison to 28, 27, 27 they are very small so, it is not hisfavorite genre, though the overall rating is 9.So, how to check that situation I have to normalize that.So, I am doing that, first of all convert the, so, I am converting them to 0 and 1,if it is smaller than 0, 0 if it is greater than 0, 1.Now I am creating a dataset, which will be used by this recommended engine, that recommenderlab library to create the recommendation engine.So, what I will do is I will create a rating mat, the library called reshape.And rating mat is getting created, rating matrix is a large matrix which looks likethis.Let me just put it up here which is nothing but a reshaped version of the dataset thatwe have created till now.Let it come.So, if you check it carefully it is saying user ID to movie ID and the value is rating,user ID, movie ID and the value is rating, that is what is getting plotted here and therating matrix all these 1’s are basically the user IDs in x axis and y axis is movieid.So, it is taking a little bit higher time.So, the view is not coming properly, I will not focus on that.And then the method is UBCF, see similarity calculation whether on cosine similarity thatis what we are doing, nearest neighbors, remember, there we took five neighbors, here we aretaking up to 30 neighbors, the library is recommender lab.The rating matrix is we are converting it to recommender labs parse matrix.So, these are basically specific to this particular library you have to use this, you have nochoice and then the similarity, basically similarity of the users is similarity ratingmat 1 to 4, method is cosine which is user so, I am doing a user to users similaritymatrix and if I try to find out a similarity score for the first four guys, this is howit looks like.So, you have taken first four, you can take the whole 1 to 668 it will give 1 to 668,I have taken the first four for the simplicity of calculation and this is what I am gettingthe image, the similarity matrix.So, the yellowish it is the better, that is the image that we are getting so, this oneis coming as to be good we have check the red colors again once more.Then I can also find out the, compute similarity between the first four movies so, I can doit for the movie by movie similarity also so, for the first four movie if I try to findout the similarity, this is the similarity, see 0.9, 0.95, 0.91 so, the first four moviesare very similar probably with each other.Now, explore the value for ratings so, this is where I am actually trying to find outthe ratings, the value of ratings so, vector ratings is equal to as vector, rating matdollar data sorry, at the rate data so, this is a different format for sparse matrix andunique of vector rating.These are the various vector ratings that you got 5, 4, 3, 4.5, 1.2 and so on.And I am creating a table for that and it will just show me what is the tabularratings so, vector ratings is these are the how many vector ratings that you have gotis something that has been listed here.Now, vector ratings is not equal to 0 aI will only take why it is not equal to zero?Because zeros are NA’s if you remember, so, I am taking that and making them factorand then if I am plotting them by chance, it is basically a histogram plot that howmany different kind of ratings that I am getting here.Then if I explore the viewing behavior with a similar way, I will get, I will not spendtime on this, we will just check it, I get a viewing behavior of the top five moviesalso so, these are the top five movies viewing behavior.And then if I visualize the matrix, I should visualize this one, the heat map so, the heatmap of the first rows and columns so, this is the user rows and this is the item columns.The darker it is, the more it is preference towards that based on the dataset that wehave got.Now, what I will do is I will that rating mat if you remember, the rating mat, thisis what I am coming back once more the rating mat dataset.This is actually the one that I will use in my UBCF the recommender lab library.So, I am normalizing it first and after normalization this is the values that you are getting, theitem columns and the user rows.And all these 1’s are basically the heat map, what is a similarity level 2 to minus2, various items and various rows how close it is, how is the chance of seeing this particularthing.And then I use UBCF, UBCF means?User based collaborative filtering, and nearest number is equal to 30, cosine is equal to,method is equal to cosine, you can change this cosine to correlation and etcetera.So, that is how you get the model and you find out the details of the data.So, the model details of the data says that there are 668 and 10,035 rating matrix ofclass, real rating matrix.With this many ratings has been used and this have normalized using center of rows.Now, if you want to find out a recommendation, the top 10 recommendation.So, for the first person I am taking rating mat 1, if you change this to 2 or 1 to 10,you will get the recommendations for all the 10 guys.So, recommendations for 10, it is the top 10 list for one user is something that issaved here in recom.So, if you check the recom, basically, the items and the ratings and the item levelshave been written here.So, the next part is basically if you want to see the recom, this is the recom list.So recom underscore list is basically the first person, the id of the various movies.And obtain the recommendations based on this recom list if I just tried to find out thenames, then recom result will give me the names of the movies for the first guy.So, that is how based on UBCF you can check the code properly, we are doing it.Now, if you have to evaluate and if I evaluate with n is equal to 1, n is equal to 3, see1 nearest neighbor, 3 nearest neighbor, let us say 5, these are the three things thatwe are checking let us say 3, 5, 10 and 20.5, 10, 20 these are the three guys I will check then how will I run it?I will just run it like this.So, good ratings is equal to 5, bad rating so run this.First rating is taking 5 nearest neighbor, then it will take 10 nearest neighbor andthen it will take 20 nearest neighbor and what is the evaluation score and etceterais something that we will see right now.So, if I check that evaluation results, you will get it here so, this is the evaluationresult for 5, 10 and 20 corresponding true positive, true negative rates and etceteraare giving here.So, the reference is this particular link, you can get a better discussion about theabout this particular thing in this link, I have taken the code from that link and probablythe dataset has been first publicly available.And this is how we create the recommendation engine with rating dataset with bigger dataset.In the next video, I will also show you how to create a recommendation engine with a smallerdataset which can still be handle able.Thank you very much.I will see you in the next video.
Invieremo le istruzione per resettare la password al tuo indirizzo mail associato. Inserisci il tuo indirizzo mail corrente