Loading
Notes
Study Reminders
Support
Text Version

Solving a Problem with Clustering Methods

Set your study reminders

We will email you at these times to remind you to study.
  • Monday

    -

    7am

    +

    Tuesday

    -

    7am

    +

    Wednesday

    -

    7am

    +

    Thursday

    -

    7am

    +

    Friday

    -

    7am

    +

    Saturday

    -

    7am

    +

    Sunday

    -

    7am

    +

ello everybody, welcome to Marketing Analytics course, this is week 3, session 4 and I willbe discussing about Segmentation Targeting and Positioning. This is Dr. Swagato Chatterjeefrom VGSOM, IIT Kharagpur who will be taking this course for you.So, till the last presentation we have discussed about how to do segmentation targeting positioningusing clustering methods. So, here we will actually work on a particular problem.So, if you have seen the in your files there is a customer dot csv data set and the dataset looks like this. So, then the data set has serial number, age of a customer, maleor female, male this, 1 means male and 0 means female it is a dummy variable. The incomeof the customers, the distance of the retail store, so it is a retail data let us say andthe distance of the retail store from the address. So, how do I know the address ofthe customer? I actually come to know about the address of the customer when they fillsit up. So, these these customer data has been trackedthrough the loyalty card they have. So, whenever you actually buy something to gain points,you swipe your loyalty card. And when you swipe your loyalty card, I come to know aboutyour data. So, I, while you actually register for that loyalty card. What we had was youraddress and I know that the zip code of your address or not the zip code sometimes we knowthe Google location of the address as well, at least the lanes location. And from my retailstore I can find out using Google Maps certain distance.So, this is the part which will be done by a coder and this is not something that youassume probably have to do. You might have certain business analysts and etcetera. Ifyou are doing for academic purposes, sometimes you have to do on your own, but some otherexpert can do it using Google Maps. So, that is number one, a distance. And then I havecertain behavioral data. So, column B, C, D, E are all demographic data of the people,persons and I will not use that for my segmentation I will focus on the behavior of the people.So, F column is shopping experience, G column is Nov that means number of visits, not shoppingexperience sorry shopping expenditure, F column was shopping expenditure in lakhs or in thousandsper month. Nov was number of visits in units and then Pgro, Pgro is actually how many,so how what percentage of your purchase is in grocery? What percentage of your purchaseis in these in food and beverage F and B? What percentages in FMCG? And what percentageis in apparel? And there can be another various kinds of observations that we can find out,behaviors you can find out, but I am focusing on these six. So based on these six, we willdo our cluster analysis and then we will try to find out what kind of customers they are?So, the first thing that I will do is I will set marking directory to source file location.So, a w3s3.r actually s4 it should be s4.r that is the file that we are working on. Andyour global environment should be empty, your console should be empty. In that position,we are calling this particular data. So, I am calling this data, this data has 537 observationsof 11 variables. And the structure of the data looks like this that all of the belowones are integer variables so they are numeric. So, I do not have any issue I do not care,I do not have to change anything. So, then I get a subset of the data becauseI will work on the last six columns, I create the subset of the data and I create a datawhich is 537 observations of 6 variables, which are basically the behavioral variablesand then the first thing that I do is I create a hierarchical cluster. So to do that, I createa distance matrix. So the function is dist, dist. And what is the input? Input is dataand what is the method? Method this Euclidean, so it will create our distance metrics forevery person to every person 537 to 537 it will create a distance matrix and that iswhat has been created here. So d is equal to dist and then if I run this,I get a d which is a large distance metrics. Using this matrix which you can print herebut 537 into 537 is very huge, is no point on printing. If you want to save it, you cansave it by right dot csv and change this d to data frame and then save it. But I do notwant to do that, I create a hierarchical clustering. Formula is simple, hclust is the code h classand then the input is d. and the method I am saying I am asking is use word’s methodyou can use some other method as well. So, if you just search for hclust, here theywill say that okay, method is equal to complete or method is equal to single or method isequal to average, which one will you want you can choose, any one of these things youcan choose. So, I am saying that let us say method isequal to instead of complete or whatever I want methods is equal to let us say centroidor let us say complete only, complete, method is equal to complete. So, complete means itwill find out every distance. So, if I just run that, I got this. And then if I want toplot it, it gives me a plot and I want to show that. So, the below one is a Dendogram,study that the Dendogram carefully, so the all these small small small things I have537 observations and they have come together, that is why it is so, so clumsy. But all ofthese things are each of these lines at the bottom is one single person.And then they join two persons, they join two persons in one group and when they jointhese two persons in one group, they actually achieve some distance. So, either you canstart looking from the top and from the bottom. So, at the bottom, when everybody is in onesegment you will see that these segments are so close to each other that you cannot eventhink that they are very different from each other. So, this does not make any sense toyou. So, why does it make sense? So, you will carefullysee that these are the guys, sorry, these are the guys who are in one segment and allthese guys are in another segment. So, there are two trees that is coming up at the top,one is this people and another is these set of people. So, that means this small groupare very different from the other group. Now, remember this is a data that has been createdfor to have this kind of a data set. Now then, out of this huge group, the next term comeshere that these people, these people are very different from this set of people.So there are, I can see three segments, one is this, another is probably from here tohere this segment and then a small segment at the back. So, there are three segmentsthat I can see which is properly visible. So, when there are three segments, which arethere, we can actually run for the rest of the thing. So, what we can do?So, I will say that cut the tree in three segments. So, I will say groups cut threeinto three, run and then border them and you will get the borders properly. So, here, sohere there is one more segment that is coming up here very strictly. And I will talk aboutthat segment probably.So, if I just clear this up and break it into four segments instead of three segments. IfI run it into four segments, then I probably will have a better view. So, I plot this oncemore and then I break it into four segments and now carefully see that each segment hassome meaning this is one segment, this is segment number two, which is a big one. Andsegment three and segment four are very small, but they are have some meaning. That is whythey are coming up like that segments and we will discuss about that. The next stepis k MEAN, so to k MEAN, I told you that you have to decide how many segments do you want.So, one thing is I told that okay, you can have four segments using this other thing,other thing I will actually run that. So, what I am doing? I am writing. So, when thereis no model, when there is absolutely no clusters, everybody is in the same thing. Then whatis the net I would say, what is the segment I would say?So, if you remember the formula, the formula was like this, the formula of each, when everybodywere in different segments. The formula was like this that so what I will do, I will saythat x person, so ith persons jth x value, so its ith person is let us say x 11, x 12up to x 1k. So, ith persons, jth characteristics, ith persons, jth characteristics minus themean of jth characteristics, square them up, square them up, that is what? That is themean of the distance of ith person from the mean and then sum that up.So, what will I do? Once more, once more, carefully see what I am writing. So, let ussay, let us say there are our first person whose observations are x 11, x 21, x 31, xk1. The second person which is x 22 sorry x 12, x 22, x 32, x k2. Then there will ben number of persons which is x 1n, x 2n, x 3n, x kn. And there will be in between a jthperson, ith person who is x 1i, x 2i, x 3i, up to x ki. So, total number of people isn and when I denote any one person I denote it with i. Now, if I take a mean what is amean observation, the mean observation is actually the mean of this, then the mean ofthis, then the mean of this. So, each one will have one mean x 1 dash, x 2 dash, x 3dash and x k dash everyone will have every parameter.So, this is brand awareness, this is price sensitivity, this is brand loyalty, this issomething else, everything will have a mean I tried when they are all in different segments.So, sorry probably, when they are all in one segment or in that case, how much is the errorwhen when am I cannot explain anything of the model.So, I will find out how 1 is distance from this, how 2 is distance from this, how threeis distance from this and so on. So, I will find out all of that thing. So, what is that?So, how 1 is distant from this mean? That is x 11 minus x 1 dash square, plus x 21 minusx 2 dash square, plus x 31 minus x 3 dash square and plus 1, this is the distance ofone, guy number 1 with the mean. So, what is the distance of ith person withthe mean? x 1i, x 2i, x 3i and so on. So, can I write it like this that x m i minusx dash m square summation m varies from 1 to k, I can write that. So, I can write thispart carefully see, I can write so, I will just rub this off so that we can understandit properly. So, this is the part that I am focusing on. So, I can write the equationthat has been written below this equation as this, I can write it like this. So, thatis what I am writing and that is the distance.And then the so that is why what I am doing here in this code? In this code is I am writingapply data 2 var. What does this do? Apply data 2 var, if I just run this much, it willactually give the variance of each of the column one at a time, each of the columnshold data sets variance. Now, what is variance? Variance of a single column can be writtenlike this. Just check, variants of a single column can be written like this, varianceof a single column can it not be written like this.So, if it is mth column, then correspondence this thing is individual observations minusthe mean of that particular guy, summation of that by n minus 1 number of observationsminus 1 that is variance. The root over of this is, the root over of this is standarddeviation. So, if that is variance, then can I not write that this one is nothing but nminus one into variants of my mth column.If I can write that properly, then you see that this is what I am writing. So, applydata to variance. So, these are the variances, then multiply with n row of data and n rowdata means 537 minus 1. So, multiply with it n minus 1 and then add them up, sum itup. So, this is the distance within sum squares when there is no clusters. So, when thereis only one segment, if not more than one segment.Now, when they are the more than one segments, I will actually find down that within someof square using a function called k means, k means and the i will vary from 2 to 15 means,I am varying the number of clusters from 2 to 15. So, when cluster number is 2, let someonecluster number is 2.What does kmean will give? kmean will say that this is k means, what does k means say?k means will say okay centers is equal to 2. That means it will do read the data andcreate segments with two segments. Now, I do not need the segments right now, becauseI do not know how many segments I have. But I want to know when there are two segmentswhat is there within ss, within sum of squares. So, if I break the data set into two segmentsand for each segment I find out the sum of square, how much how much total I get andthat value is when centers is equal to 2 okay so, sum of that, sorry.So, okay so, withinss this I have not written correctly withinss with. So, this is the 2sum of squares of 2 segments. If there are 3, then each of the guy will have 3 differentsum of squares. Sum of squares means the distance between each of the observations from thecentroid of that particular segment from the mean of that particular segment. So, I findout the distance. So, the, these are the three distance when there are three segments.When there are four segments, these are the distance and then add them up I joined themand add them up and that is why a sum sign. So, then I put those summation, so summationof these two and say that when segment is two this much will be that total withinss,when when within ss that means within sum of squares. When there is 3 segments if Iadd these 3 guys, this one will be that total within sum of squares. If I add these fourguys up, this one will be the total within sum of squares. I save that in this wss thing.So, if I run this I get a wss which is nothing but basically 15 observation. This is theobservation when there was only one segment, this is the observation which is there werethere are two segments which is nothing but summation of these two.This is when there is three segments, so C, slowly the within sum of square is going down.When there is 1 segment you put bananas and oranges and apples everybody in one group.That is why the distance between the group mean from the group mean individual guys distanceis very high. Now, when you have two groups, the bananas come in one group and apples andoranges comes in the second group. But because bananas are there in one group, the distanceof individual bananas with the group mean of banana is 0 or very low. So that is whythe distance comes goes down. Though oranges and apples are still whichwere in the same segment but they are still different. But the overall distance has comedown. Now, when I you further break, you break apples and oranges also into two differentgroups the within sum of squares further comes down, so slowly it comes down if you see.Now, I plot it and I want to see that how that plot comes down. So here, there is agerm but I can say that okay up to 4 or 5, I will take as a 4 or 3, it is a call, youcan take the number of segments. So, let us k 1 is equal to 4 I decided. So, using thismethod I decided that k 1 I will take as 4. Now, if the number of k 1 means the numberof means, number of segments I have take, we will take as 4.So, then what will be the observations? I will just run this line which will break thedata set into four clusters. And if I want to see how these clusters are, I am done andI have already done the work. Now, I am doing the third part, if you remember the thirdpart is creating the segments, parameters the identification of the segment.The identification of the segment will look like this. So, I have done the aggregation.So, I will just plot once more, the aggregate is like this, check it carefully. Serial numberwas, so the first, serial number does not make any meaning.Segment 1 in terms of behavior if I try to only focus on behavior and nothing else andI will probably run once more, sorry. So, if I want to plot them once more in termsof the behavior, you will see that segment 1 has 49 parts 49,000 per month shopping experience,shopping expenditure. So their expenditure is 49,000 per month in one retail store. Onthe other hand, other guys are 6000, 6000 or (seven thou) 6.6, 6.8 and 10. So, I cansay six point half and then seven and 11 something like that is the shopping expenditure. So,group one is high expenditure guys, that is number 1.But group 5 visits very low amount of time only twice in a month and these guys averageis 5, almost 5. So, group 1 is totally different from other guys, they visit for limited numberof times, two times a month and purchase a lot. Their major purchase is grocery and apparelalso, grocery and apparel is the major purchase, FMCG and food and beverage is not somethingthat they purchase a lot. What is the distance? The distance is notvery different from each other, distance for all the groups are same. These guys incomeis a little bit higher than the other group. They are predominantly male 0.875 is maleand middle aged. These guys are further younger 33, 32 and 36, this guy is 45. So, middleaged men, so middle aged men. So, if they are middle aged men, who are these persons.Can you, can you imagine? So, there is a middle aged men comes twicein your shop only twice not more than that probably very low amount of time they come,but they bulk in, they purchase in bulk 49,000 average shopping expenditure means or probablytotal shopping expenditure. That means 25,000 in one shopping expenditure means they purchasein bulk and what do they buy? They buy grocery and apparel items.Probably there, so, you do not buy 25,000 worth grocery or 25 percent worth apparelevery month. So, a normal customer cannot buy and if that is not the case then probablythese guys are B to B customers. That means they have small retail stores, they come andbuy from these big retail store and sell it in the small retail store. They can be resellers,they can be small kirana stores or something like that. So, that is what a segment whichis coming up, which is very prominent. What are the other segments? The other threesegments are, their shopping expenditure is 6, 6 and 10. So, they are not B to B, theyare B to C. Now, though all of them visits around 5 times, so they are weekly visitorsso, that is okay they are not very different about that. But segment 2 if you carefullysee or probably segment 4 if you carefully see the FMCG is 40 percent of their purchasewhich is majorly they come to buy FMCG. So, who are these people? They are a 50, 50male-female, more female than, some more male than female. And their age is a little bithigher than the other group. And their income is also little bit higher than the other group.So, these guys who are a little bit higher income will focus more on FMCG and they areready to come from a long distance also. So, they are coming 4.25 kilometers away fromthere also. On the other hand, there is another groupwhose major purchase is apparel if you see 35 and these guys are youngest 32. And theyare also more or less more famle they are than the other groups and then their distanceis the shortest 3.78. So, see the closest guys are obviously the B to B guys and thenthese guys who come to buy apparel, they can, they also want to want to have a lower distanceand then group 2 majorly buys FMCG 36 percent also.So, okay so we have to check now that how group 2 and group 4 are different. Group 2also buys 36 percent FMCG, group 4 also buys 40 percent FMCG. Group 2 also makes five visits,group 4 also makes around 5 visits. Group 2 shopping expenditure is 6, but these guysshopping expenditure is 11. So there is some difference in terms of shopping expenditureand what does that come from? That come from probably from income, this guy's income is5.8 it is 7. So, one lakh extra income that should notimpact the shopping expenditure so much. There is any other difference, okay. These guysactually make expenditure on food and beverage also these guys have very low expenditureon food and beverage, only 12 percent versus 22 percent. So, does group 2 I can say thatgroup 2 are mainly male, who are of age, average age of 33. And their income is around sixlakhs per annum. They live close by within 3 kilometer or 4 kilometer distance from thefrom the retail store and their average shopping expenditure is around six and half thousandper visit or per month. And majorly they buy both FNB that means food and beverage andFMCG, FMCG is the primary but they also buy significant amount of food and beverage material.On the other hand group 4 have all of these things similar but their shopping expenditureis around 10,000, their income is around seven lakhs. And they only focus on FMCG, they donot focus on the rest of the thing. Probably they focus on groceries also FMCG and groceries.So, one group probably focus on food and beverage, which is packaged. Another group grocery meanstheir food and beverage which is not packaged, which is, so the focus is different. The ageis also a little bit higher for group 4. So, I can probably assume that these guysand age income is higher, distance they are coming from a little bit longer distance.So, I can imagine that this fourth group is a family person while the third group is nota family person or if fourth group might have a larger family, might have a car becausehe is coming larger distance and etcetera. So, these are giving my some idea about whatthe segment is, from the demographics and from the experience.And now, at the last stage what I will do? I have to create a targeting mechanism. Ihave to find out that if a new customer comes up and registers with me, how will I knowwhether he is in segment 1 or segment 2 or segment 3 or segment 4? So, what I do is Iput another data set, where I quit my data 1, where I actually have put the cluster numbersthe who are in, which guy is in which cluster I have did that. So, if I just write my data1 dollar fit cluster and then try to find out a table of that divided by 537 into 100.In percentage term I know in segment 1 has only 3 percent people.So, though they are very prominent, they are in number of, in numbers they are very smallmake sense because B to B buyers will be small. How many kirana stores will be there in thelocality 20, 50? But in a, in, there will be probably 10,000 customers and 50 retailstores. So, B to B purchases will be less so that is why 3 percent. And the other oneis 30 percent, 25 percent, 40 percent, so they are fairly well sized.Now, I have in my my data if you remember, I have this as my positions of the personsand this as my age, male, income, distance these are the four demographic variables thatI have. So, using this demographic variables I will try to predict whether he is. Now 1,2, 3, 4 there are four categories, they are there. And this is not a linear regressionbecause all these four categories are different. So, I change them to factor variable as factor.And then I run a multinomial logistic regression. So, for that I will require a library callednnet and then I will call this model. So, I will run multinom instead of lm I have writtenmultinom. That is the only difference then fit dot cluster is mine, while variable age,male, income and distances my x variable and data is equal to my data 1 and if I just runthis one and if I just see the summary of the model, I get this thing. So, in the summaryof the model what it gives me? It takes 1 as, the observation 1 as the base point. So,observation 1 is 0. Observation 2 is the intercept is 15.For each age increases, so, when age increases from 0 to 1 or for unit age is increase, thechances that you will be in group 2 is least or probably not group 3 is least. So, ageincreases your chances of being in group 1 increases. And if you are a male, then alsothe chances of in group 2, 3 and 4 decreases the highest chance is group A. So, these areall negative. Income, if your income increases your chances of being in group 2, 3, 4 isalso lowered and probably lowest is in 3 and probably in comparison to that 4 is much better.And as distance increases the chances of being in group 4 is highest. And these are the correspondingstandard errors. So, what, how do we find out the how, whether they are significantor not? We do the coefficient by the standard error that will give us the T statistic orZ statistics. So, we find out that Z values, so Z valuesare for all of them probably this one is marginal and these guys, distance is not significant.The Z value is 0.84, 0.79, 1.08. So, remember it has to be higher than the mod value ofthe Z has should be higher than 1.96. So, but distance are not significant then, butthe others are significant probably this one is also not significant.We can find out the probability, exact probability values and the p scores, okay. So, other thandistance, which are all higher than point 0.5, the rest of the four things intercept,age, male and income. For all the observations, they are significant. So, that means thatI can probably run this thing. I can probably run this thing using not using distance andthen run and this is the right score. So, how to interpret it? If somebody comes asan age of, if somebody comes with this observation let us say, if some, somebody comes with aobservation of let us say his age is of 30 years and he is a male. And his income isof six lakhs then what is the observation? The probability that he will be in group 2,first of all it is the U or whatever I do not know. So, let us say a of group 2 is basically16.37 minus 0.24 into 30 minus 1.7 into male minus 0.45 minus 1.7 into male because 1 minus0.45 into 6. So, this is the probability that he will bein a 1 walk a 2. So, this is not probably, this is the measurement. And similarly I findout a 3, a 4 also, what is the values? And a 1 is equal to 0. So, probability that thisguy will be in group 2 is basically e to the power a 2 by divided by e to the power a i’swhere i varies from 1 to 4. In other words, e to the power a 2 by 1, 1why? e to the power a 1 is e to the power 0 that means 1 plus e to the power a 2 pluse to the power a 3 plus e to the power a 4. So, something like that will actually giveme the probability that this guy will be in group 2. And group 3 and group 4 and whenever,whichever segments probability is higher, I will put that person in that particularsegment. Similar thing we can do with LDA also. AndI will not do that, we can also break the model and see that how is the predictive model.So, I have broken the data set in training and testing, created the model with the trainingdata, predict it with the testing data and finding out the confusion matrix, it is asimilar job that we have done for logistic regression you can try out that.And here if I run these four lines together, this is the confusion matrix gets created.You can see that there are lots of off diagonal elements. Now, what are the accuracies? Basically50 plus 353 and another 17, so around 70. So, 53 plus 17, 70 out of how many? Out of150 or what, just one minute, so training data is after 150, 70 out of 150 is less than50 percent which is bad. So, you have to find out some other demographic variables whichexplains the data set better and you have to try to improve your predictive modeling.So, once you predict better predict the segmentation, which segment they will fall, the targetingbecomes much easier. So, that is what we will have done about logistic regression, multiplelogistic regression. We can discuss more about LDA in the next class with a small exampleon these data set itself. I will share 2, 3 lines of code and we will discuss that andwe will go ahead with the next module from the next videos. Thank you very much for beingwith me. We will meet you in the next video once more. Thank you.