Hello everybody, welcome to Marketing Analytics course, this is Doctor Swagato Chatterjeefrom VGSOM from IIT Kharagpur who is taking this course and we are in week 8, session2. We are discussing RFM analysis, Recency Frequency Monetary Analysis and I told inthe last video that this is a analysis technique which is used to do behavioral segmentationof customers based on their purchase data. And R stands for recency which is, how close,how recently you have purchased and F stands for frequency which means that, how many timesyou have purchased in a certain period of time within a certain period of time and thenmonetary is how much money you have generated, how much expenditure you have done withina certain period time. So, based on that, we will do segmentation, RFM score high meansis a profitable customer and RFM score low means he is a customer which you have to reactivate.So, here in week 8s 2 dotR file, if you open so, you will find out it is in the in yourin your file link in the in the in the in NPTEL portal that you have and there if youjust download it. Remember in this if you just carefully see the code here we have notused any data set data set is there in internet and I will right now show you how to connectwith Internet in terms of the data creation.So, you have to first check that the internet is connected which is there for me. And so,I will just run this line. So, read dot table and then the text. So, the when if if youcan go and check this particular link and see that there will be a file there is nothing,one particular page where lots of data has been written which is like a notepad, butever you write in the notepad similar thing is there. So, when you read that kind of adata, you use read dot table and if by chance the header names is at the top you shouldwrite header is equal to true and that is what you get.So, I have got customer ID, the date, the date is given as, this is what nine 9, 9,97 01 01 means, first January 1997. Number of purchases and dollar value so, RFM datahas been given to me. So, the first job that I have to do is to find out that how we recentthis data date is? The last purchase date is this that, how recent it is? So, I haveto convert this particular column which is the date column to our.Right now, if we just check the structure of my data set, see the date is basicallya integer, see it is an integer, so I have to somehow convert it to date. So, the firstthing that I am doing here is changing it to character.So, if I just write as character of D F 2 what will I get? Just copy this and pasteit here, and press an enter. See each of them has been converted to a character form. Now,only a character form written in a specific format can be changed to a date form usinga function called as dot date.So, as dot date watch like this, let us say I am talking about this one 1997 01 01 thatis the thing that I am asking. So, as dot date how will it work? as dot date 19970101So, you I am saying that this particular text, you change it to a date form, and how willyou read the text exactly whatever is written. So, for 1997 when the full year is written,you have to write percentage capital YSo, these are some of the details that I have probably discussed before I forgot, so I willjust write percentage capital Y is for 1997, percentage small y is for only 97, so thatkind of a. so, if it is y y y y then it is like that, if it is only y y, font is in smally small. Percentage m is for m m, month, percentage d is for date d d and percentage B and percentagesmall b if I am not wrong just check. B is for full month, let us say January or letus say December, when it is written in a full form you use this and when it is written insmall form so, Jan, Dec you will use small b and the rest of the will be exactly thesame format in which the text is written. So, I will follow that.So, here what is written? Percentage Y, Then percentage m and percentage d. So, if I justwrite then now it is reading it as a date, how will I know this is a date? Just do somedate operation, so, I am saying this minus 1. So, that will give me, see 31st December1996. So, that means one date before, so, now I know that this is how I have to writedate. So, that is I have written it like this.That as corrected df comma 2 that means df data sets, second column, this one you convertit to character and then change it to date using this format and then save it where?This date again to df’s second column. So, carefully understand once more. So, df secondcolumn that means the date column, df dollar date you could have written, df dollar dateyou change it to character using ascharacter, then you change use that character form asthe input in the as date as dot date function and you also have the formatting which ispercentage capital M, percentage y, percentageAnd when I run this, the whole thing has converted to dates. So, these are dates. So, if I justwrite str of df now, they will find this is a date type of format, fair enough. So, ifI can find out the date format now, how does my data set look like?The data set looks like this. And what is the dimension of the data? the dimension ofdata looks like 69659 comma 4 that means 4 columns and this many rows. So, this is somethingthat we have. Now, there might be certain duplicate data.So, I will remove the duplicate data first using the duplicated function and then I willagain check the dimension. Now, the dimension has stopped. So, there are lots of duplicatedata. So, that is something that was a problem in this data set. So, now I have 23,570 observationsof 4 variables, fair enough, nothing. So, then what will I do? I have the date column,I have todays current date, how would I know what is todays current date, so, the today'scurrent date can be found out by the system date.Now, here when I am actually recording this, today's date is basically, sys dot date so,this is today's date, which is 28 January 2020 on the date of recording. So, you mightget different value when you run this, when you see the values will be different obviously.So, you check your known system, whatever system date is coming that particular dateshould come now. Now, whatever be that case, a certain data you have given and currenttoday's date, that date and today's date difference is you are basically a measure, opposite ofthat is the measure of recency.So, I will say that UID dollar timedist, timedist stands for time distance, is system date minusUID dollar date. So, today's system date minus that particular and you will get differentdistance also.So, now I am getting 8427, 8416, your values will be different obviously, because you arerunning it in a different date. Now, you can say that, okay, for my this thing, I willfix it to a particular data from that date I will find out a distance, so that your datesand my dates are same does not matter, so this is something that you got us time distance.Now, remember in this data set, that time distances, higher is preferred or lower ispreferred? Yes, lower is preferred and then dollar value higher is preferred or loweris preferred? Higher is preferred very good. And then number of cds also? higher is preferred,very good. So, this is something that we have to understand. So, when I create the rankingbetween 9 to 0 or whatever, I have to keep that in mind, very well.Now, what next? Next, I am calling a library called tidyverse, so tidyverse if by chancethe package is not there, you have to install this. So, tidyverse, I will just call thislibrary first here. Okay, I am not, I am pretty sure that is not there in my system, so Iwill install it. So, if you have not installed it, you have to also install it so it willinstall all the supporting libraries as well, this kind of links will come this come, so,it will download certain libraries one by one, some are big, some are small it mighttake some time for some of these things, but otherwise it will be okay.So, it will download lots of libraries. This tidyverse will be used because the data setis big and sometimes you might want to do some data processing with this data whichis required and sudden DPLYR is a library that we have used if you remember at the earlystage. So, here there is a library called tidyverse which has some some some featureswhich can be used in creating. So, if you check the dependencies, it also has theseDPLYR or DDPLY, DDPLY is the actually the function but the DPLYR or PLYR this kind oflibraries will be there, they do basically, processing of columns.So, in a huge data set, you sometimes, instead of doing row operations one by one, you docolumn operation, one column at a time you just change it to something or find out amean or do certain operations, certain analysis with them. So, for those kind of piping andetc. kind of analysis, this tidyverse or DPLYR or PLYR packages are helpful.Now I am installing it for the first time, so that is why it is taking quite a bit oftime, but in your case, you should first see that what kind of libraries is required forfor your analysis, and then you should install them and then you can go ahead for your analysis.So, it might take one more minute to analyze this to get get all the files ready.Once I a I got this ready, what I will do is, in the next step, if we just bring thisI will just bring this below and then scroll this down you check what did I do in next,I use the mutate function, the same old DPLYR mutate function, to change what? To changeto work on the data set called UID. I will mutate, mutate means it will change the dataset and add certain columns on that. What I am doing? I am saying that recencyis equal to 10 minus ntile of time distance comma 10 .So, what is ntile? ntile is is theparticularly the function for which we are using this tidyverse, ntile is a functionwhich breaks a numeric column into deciles or quartiles or n number of tiles basically.So, ntile, so what it will do, it will actually sort it up from highest to lowest, and thenbreak it into 10 group either either highest to lowest or lowest to highest basically andthen it will highest to lowest in fact, and it will break in 10 groups, if I have writtenntile time distance comma 10, if I have written 4, then it could have broken into 4 groupsthat is four percen-tiles. Now, remember why did I write like 10 minusntile time distance 10. So, that is something that we have to understand.So, if I just now, let us say I had a data set, where I had timedist, I had recency andI had frequency, this is the data that I had. Now, I put ntile in it. I put ntile in it.,so this one is okay, this one the higher recency will get 9 and the lower recency will get0. Here the higher frequencies will get 9, lower frequencies will get 0, but here intime distance, the lower frequencies will get 9 and the higher frequencies will get0. So, sorry time distance higher will also get 9 and lower will be 0.So, how to convert this thing, this is 0. Now how to convert this thing, this is frequencysorry and this is monetary I am very sorry. So, for frequency higher frequency will get9 (monitor) lower frequency will get 0, for monetary higher monetary will get 9, lowermonetary will get 0, for time distance higher time distance will get 9, lower time distancewill get 0. Now, I have to convert it in some way recency column, but these guys will be0 0 and these guys will be basically 9 9 or something like that and then practically not9 and 0 will not also work. So, 0 to 9 is 10, so, that is fine, so 0 and 9. So, howwill I get here, so, I might want to put 9 minus 0 or something like that. So, the momentI do 9 minus that will work for me. So, let us see how the result comes and based on thatwe will do something.So here I have installed it I called the library. Okay, so, library has been called and thenyou just see ntile if I just write ntile ,ntile let say I write timedist dollar, so df dollartimedist, so UID dollar timedist comma 4, I will write 4, okay, so 4 categories it isbreaking.So, you say that the value that I am getting is 1, 2, 3 and 4, not actually 0 to 3.So, if I if I break it into timedist dollar 10 ,I will get 0 to 10 sorry 1 to 10, not0 to 9, but I want 0 to 9, so what do I have to do?The first thing that I have to do is this minus 1 that will give me 0 to 9, fair enough.So, there see there will be some values, which you are not getting here n-values which as0. So, because probably last values are very, very close to each other, you might get somevalues.So, for this one, which is 443, and then another 10, let us say 453, I will get a 0. So, 4,yeah, so see there is a 0 here. So, some zero’s will also be there, there is one 0 here, one0 here and so on, so, some zero’s will be there.So, that is what I am doing exactly for frequency, I am doing number of cds, I am breaking itto 10 groups and then minus 1, for monetary also number of groups comma 10 minus 1, sodollar value comma 10 minus 1, minus 1 means instead of 1 to 10, 10 deciles it will give0 to 9, 10 deciles, that is why. And here it is also gave him 1 to 10, so the momentI do 10 minus that it will it is automatically giving me 0 to 9, right? 1 to 10, 10 minus10 becomes 0, and 10 minus 1 becomes 9. So, the moment I write 10 minus this it is givingme the highest values, the highest time distance values will get the recency score as 0 andthe lowest time time distance values will get a recency score of 9, so that is whatI am making sure.And mutate means, make these things these changes and save it in the original data set.So, if I just run this one, it will save it in the original data set, this is my originaldata set, frequency, recency, frequency, monetary, three columns gets created. Each value willbe between 0 to 1, so this is something.So, if I just find out the the probably the summary of my UID you will see for recencyfrequency monetary, the lowest value is 0, the highest value is 9. Here also the lowestvalue is 0, highest value is 9, here also the lowest value is 0, highest value is 9.Now, there is some problem when we do that, but anyway.So, if I just draw the histogram of recency, you get the histogram as this, this is thisis the current histogram.So, if I do the same thing for monetary let us say, for monetary or frequency, this isthe histogram and monetary this is the histogram that is basically the problem that this ismaking making them into same number of people in each group, which might not be the case.For example, let us say, hist dollar if I talk about time. Okay, time distance is notnumeric, so just change it to numeric, yeah, so see the distribution looks like this, whichis not exactly not exactly straight, so it is fairly distributed. In some months, thereare probably more number of observations and some other months, so, this is something thatwe have to take keep into we should have a look on that.On the other hand, if I just do the same thing for let us say, dollar value, dollar valuethat is absolutely highly skewed, it is not at all distributed properly. So, often sometimesit might not be a good idea to break them into 10 equal deciles, but that is your subjectto your call you have to decide that how much so, I if I were you I would have given thatokay for this is 2, for this is 5 for this kind of data this is 100 probably the pointsthat I am assigning and in that case the last one the regression model based RFM analysiswould have worked better, but here that will not work that much better, so, that is somethingthat is also a very important decision model. So, if your variables are fairly distributed,you can do this kind of thing if it is not fairly distributed, you should do the otherone.So, anyway, so mine is fairly distributed and then what do I do I create RFM score whichis 100 into recency, 10 into frequency and then plus monetary, so that is the basic scoreof RFM that I am generating.So, if I run this then I get another column in UID which is nothing but a combinationof this score, so, 0 0 0 you are getting 0, 1 0 1 you are getting 101, 0 5 4 it , so themoment 54 is written, I know the recency score is 0, that is why it is coming 54, otherwisethat come 154 254 or something. So, 54 means, recency score is 0, the frequency score is5 and the monetary score is 4, so automatically I am getting RFM.And if I want to, let us say if I want to find out the top, so these are my top guys,a 999 guys, and then comes the 998 guys, we scroll down, so quite a few 999 guys becausemany people might have. Okay, so I I okay after 999 there will be 953 and so on. So,there will be you can just scroll it down there will be so many people. So, first thingwill be based on the last one and then the second one and the third one you will youcan add on them.Where is the histogram? How does the histogram looks like it is similar. So, it is similarto the profitability that you can generate from them, so, which is good this is the histogramof RFM. And you should also check as I told that the RFM you should find out that howthe histograms of the original variables are.Now, the another thing that I am doing right now is that if the distributions are not veryfairly distributed. For example, the histogram of cds, if I just check that is somethingvery skewed.And then histogram of dollar value that is also very skewed.But histogram of time distance is more distributed than them. So, what I am doing here that insteadof creating ntile, what I am doing is I am fine breaking them into intervals. So, findinterval dollar value, and what are the intervals? Below 10 it is 1, 10 to 30 it is 2, 30 to50 it is 3, 50 to 100 is 4, 100 to 125 it is 5. So, I am breaking based on this picturethat I got.So, if you see this is my dollar value, 0 to 50, all this huge line is 1, sorry 0 to10 the huge line is 1, 10 to 50 is 2. So, I am breaking here, these guys this big line,I am breaking into three groups, 1, 2 and 3. So, up to 50 there are serial number 1,0 to 10, serial number 2, 10 to 30, and serial number 3, 30 to 50, so I am breaking thisthing in three groups. And then breaking this one into two groups, 100 to 50 to 100, 100to 125, 125 to 150, same groups again and so on. So, when I do that, I get a betterdistribution of monetary.So, if I just run these now, you will see that monetary is more better distributed,this is a monetary score that I created now, and this monetary score is more distributedlike the previous one. So, this distribution and this distribution, more or less lookssimilar, at least at least in terms of the shape they are similar. So, now you can usethat kind of a score to create your RFM analysis, which is histogram.And the RFM score, the new RFM scores are this one the similar kind of RFM that yougot before similar RFM score, but remember this guy was 51 before now I am getting 54before now getting 51, this guy was 0 before not getting 1. So, I am changing this coursesomewhere other and I am getting certain values here. So, this is how we can also do RFM analysiswhich is more and more in line with the original data set that we have collected.So, that is all about it is very easy to implement that is one of the major reasons it is used.Another major reason of using it is probably that you do not need a lot of and you cando it in Excel as well, because the data set is big, and because there is not unique datasome data was duplicated and we are doing it in but in maximum cases it is not donein R or Python or something like that. In maximum case the RFM analysis done by Excelonly, so, they or probably SQL if it is a database that they have certain values, theyhave certain certain macros and etc. already written, which will convert this time, dateto time distance, time distance to monetary values, time distance to recency values andsome other to monetary values these kind of changes are already written and the RFM scorealgorithm is already written and this is a very very, very I would say commonly usedmethod in in the context of segmentation through purchasing behaviour.So, that is all for RFM analysis, we will continue with Market Basket Analysis in thenext video and thank you very much I will come back with next video.
Nous enverrons les instructions pour reinitialiser votre mot de passe à votre adresse e-mail associée. Veuillez marquer votre adresse e-mail actuelle.