Hello everybody, welcome to marketing analytics course, this is Professor Swagato Chatterjeefrom Vinod Gupta School of Management, IIT Kharagpur, who is taking this course for you.So, we have been actually doing this particular course and we are right now in module 3 wherewe were discussing segmentation targeting and positioning and I will continue teachingon that. This is the last video under segmentation targeting and positioning and that is fair,we will start. So, till now we have actually collected dataabout people while doing segmentation targeting positioning we collected data about peopleand what we did is we have broken the data row wise I would say means the human beingsbased on their preferences, based on their behavior, based on their preferences and thenthose who are close to each other, we brought them together and those who are away fromeach other, we actually put them into two different buckets and that is how the clusteringis done. Now, coming back to the problem of where the,where we can marry lots of different techniques, I told while discussing about consumer preferencesthat we can actually find out consumer preferences from their behavior, from their choices thatthey have made. For example, let us say if you have chosen consistently chosen brandA over brand B, I know that there is something in brand A that you like very strongly thatis why you are choosing A over brand B. So, we compare whatever is there in brandA, whatever is there in brand B, and whatever there is similarity, I do not think that willimpact your choice. But if there is something which is different in brand A and brand B,let us say the fragrance is different, if let us say the color is different, let ussay something else is, some performance quality is different, you might give lot of weightageto that. That is why you, you actually choose brand A over brand B consistently.Now, all I am trying to say here that by focusing on the choice of the consumers then we cancome to know that what is their underlying preferences, how, what are the things thatwithin their, now this is not something that is visible. See, I have always told in theprevious lectures or previous videos that we should focus on behavior because behavioris something that is visible. Now, here in this context the only behavior is the choice,the choice that we have made. Oftentimes that is the data that we have.Now, if I have only one data, which is your choice data then that becomes difficult, itbecomes difficult to understand that what kind of other behavior you do and if you donot have lots of behavioral data about the consumers, then I cannot use behavioral segmentationin my segmentation method. So, in that context what we try to do is we try to analyze somethingand try to find out the underlying preferences and as I told just now, that the choice thatour consumer made can actually help us to find out the underlying preferences of theconsumers. Now, if, how to do that in the choice modeling;in what consumers want module; or in the conjoint analysis model, we have discussed about that,that how I can actually consider consumerÕs data, how can I consider consumerÕs behavioraldata, only one behavior; in this case, the choice and how I can find out references.Now, let us assume that there are multiple such consumers and I am able to find out thepreferences of each of these consumers. If I am able to find out that, then I shouldbe able to actually find out the segments; segments means those consumers who have similarkind of preference. For example, let us say in a choice of B school, let us say you arechoosing which B school you will go for your Management Studies. So some customers, inthis case the student, there are some applicants, some students who focus on only ROI. So, whateverI am paying and whatever I am getting back in terms of the monetary figures.Some people focus on the experience as well. Some people probably have a little bit ofmore preference on the atmosphere; the opportunity of having start up; opportunity of doing somethingextracurricular as well. So, all of these things also, for some people it might matter.More matter for those kind of people who might be the first time they are going for a hostellife. So, there can be one group of people who have already had a hostel life, a campuslife and they might not focus on one another so they might not give more preference ona campus specifically in the second time after bachelors when they are doing masters.On the other hand, they have another group who probably were day scholars, they usedto go to the to their respective colleges from their home. So, they were always hometownbased and this is the first time they are going for a management study while they are,they will be away from their home. So, they might want a campus.So, the preference of campus might be different for different people; preference for ROI willbe different; preference for academic load will be different; preference for industryconnect will be different. So, there can be different kind of people in the same segment,overall population of let us say students who are interested in doing MBA. There canbe several people who might have several preferences. Now, we cannot do a survey to all this guyand ask them that okay what is your preference in 1 to 5 point scale. You just give me thathow much you prefer this, how much you prefer that. Sometimes or many a times it is notpossible. So, then what we do is we use that conjoint analysis kind of technique to findout that okay given these four choices, you have chosen A, given these four choices youhave chosen D, then that means the preference towards various attributes are this, thisand this. Now, if that is the case then I can also findout there will be some people who will be close to each other and some people who aredifferent from each other. So, that is where I will focus right now.Let us assume that from certain kind of conjoint analysis, we found that this is my datasetwhere these are my customers. So, customer 1, customer 2, customer 3, customer 4, dot,dot, dot dot customer n and these are my aspects. Aspect 1, aspect 2, aspect 3, aspect m. So,this is basically a n into m matrix. And in this particular data set, each row is, eachone of the customerÕs preference towards certain aspects. So, this guyÕs preferencetowards A11 is written as let us say preference 11, then preference 12, then preference 13and so on. This is the, actually the preference, howmuch weightage you give to this aspect over some other aspect. Now, what is importantto understand is that I am trying to find out the distance between these two group ofpeople or multiple customers and we are creating segments by checking how the preference ofthis customer is close to this customer. So, remember we created a distance matrix, a Euclideandistance matrix in our, so, here I can also find out a Euclidean distance between thesetwo. So, how will be the Euclidean distance? Thedistance between the ith customer and jth customer will be summation of the preferenceof ith customer on kth attribute minus preference of jth customer minus in kth attributes squareof that and K varies from 1 to m. So, this is the Euclidean and actually square rootof that is the Euclidean distance. So, I can find out euclidean distance foreach of the customers, each pair and then whoever is closer I will join them and I cando a hierarchical clustering by that. We can also do a K means kind of clustering whereif there are n number of customers, I can randomly choose 2 or 3 and see that who fallscloser to that particular centroid, two or three centroids I will find out and if somebodyis close to such centroid, I will put them in that particular I would say cluster.And the moment I change the cluster little bit if the customers also change, if the customerÕscluster representation also changes, then that is a unstable clustering if I go on anddoing and when I reach stability, that means, each customer is assigned to a particularcluster or a particular segment and he is not shifting from that segment. Even if Imove the cluster node he is not shifting from the segment. That is something is called kmeans clustering. We can use that also. So, here in this in this course, in the next probably10-15 minutes, we will do that.So, if you remember, we have a data which look like this. So, this data was part ofour conjoint analysis data, but I changed the E column, the column number E a littlebit, the rest of the things remain same. If you remember the data set had fuel, fuel 1,2, 3. There were three types of fuel. Fuel 1 was diesel, fuel 2 was petrol and fuel 3was CNG. Similarly, I had capacity 1, 2, 3. Capacity 1 was if I am not wrong, 8 seater,capacity 2 is 6 seater and capacity 3 is 4 seater.And then price 1, 2, 3. Price 1 was if I am not wrong, I think 12 lakhs and 8 lakhs andthen 4 lakhs or 6 lakhs, something like that. So, you can go back to the, so this was onconjoint analysis, rating based or ranking based conjoint analysis. And here the ratingsare given into 1 to 10 point scale. 1 means do not like this model at all, 10 means, Ilike the model a lot. Now, there in that particular problem, wedid a regression over the whole data because we thought our underlying assumption was thatour customer base is homogeneous. What is homogeneous? That means, they are of similarkind of quality, their focus is same, their preferences are same. So, there is not muchdifference between customer 1 and customer 2. That is why we have run the analysis onthe whole data. But here in this context I am trying to dosomething else? What I am trying to do? I am picking up each of the customersÕ data.If you check, that each customer this serial number is actually the customer ID. Each customerwill have 9 data points. So, each of the customerÕs data I am picking up and running a regressionon rating with fuel, capacity and price as my X variables.So, each small small regression there are, if you check, there are 30, no there are 60customers in total. So, then I will actually run 60 regressions. So, if I run 60 regressions,and if I do something like this, then what will I get?I will get from customer 1 up to customer 60, I will get their preference for fuel,their preference for let us diesel and their preference for price. And not only that, actuallynot only that, I will get their preference, so, this is actually wrong. I will get theirpreference for fuel 2 and 3 in comparison to fuel 1. Remember that it is a categoricaldata, fuel 1, 2, 3 is actually categories. So, keeping fuel 1 as my, I would say as mycategorical reference point, we will find out that how much the customer prefers fuel2 and fuel 3. So, keeping diesel as reference point howmuch they prefer petrol or CNG. So, fuel 2 and fuel 3. Then similarly, I will get letus say capacity 2 and capacities 3 and price 2 and price 3. These are the 6 things thatI will get for each of the customer, for each customer. Each row is 1 customer. So, I willget for each of the customer. And then I will find out how customer 1 and customer 2 aredistant from each other. If they are close, I will put them in the same cell; if thereare away, I will put them in the other cell if I use hierarchical clustering.If I use k means based clustering, I already told you what kind of algorithm you will use.But this is something that I will create first. If I create this, the rest of the job is absolutelypicking up the code from the previous codes and just run this thing, but this part issomething that will be that is new and that is something that I will do. So, in otherwords, what I am trying to do here is I am adding conjoint analysis or regression analysiswhatever you want to name. Conjoint is the mathematical, is the marketing name or mythologicalname; on the other underlying methodology is econometric methodology is regression.So, I am marrying regression analysis or conjoint analysis with clustering in this particularproblem. So, I told in the first presentation, or in the introductory presentation of thisparticular course that marketing analytics is often a smart job, its often you have toknow in your mind that what are the various things that you have in your arsenal and fora particular problem, what kind of such tools you can bring in and work them parallely orsequentially so that you can solve the problem. In this particular context, I am using regressionand cluster analysis sequentially. In the previous problem, when I was doing previousthing, there was a segmentation targeting positioning using cluster analysis, followedby targeting for which we used basically multinomial regression. You can also use linear discriminantanalysis for that in the previous slide, in previous video.And there, there was only 6 behaviors, if you remember carefully, there were only 6behaviours and that is why we only went ahead with factor analysis. But if there are lotsof behavioural data you get then to make those behaviour, less number of behaviour, you canalso do factor analysis before doing cluster analysis. So, we can, we will do give youthat as a do it yourself kind of a project that you can do it on your own, try out onyour own and see that what it comes. Now, there we married basically cluster analysiswith multinomial logistic regression. Here, we are mining simple linear regression withcluster analysis. So, let us see what do I have.So, I have opened this week 3 session 5 dot R file which is there with you and the correspondingdata file looks like this I have already shared the data file, the data file looks like this.And then what I do is, the first things first. I said my working directory to source filelocation, I actually have kept these 2 files in the same place, the R file. My global environmentand console is clean, and I reading the data. So the data looks like this, as you have alreadyseen and in line number 3, what I am doing? See if I just see the structure of the data,the structure of the data says that this fuel capacity and price are integer variables.Are they integer variables? No, they are not integer values. Why? Because they are categories.1, 2 and 3 is actually 3 different types of fuels. You cannot say that diesel is 3 timesof CNG or petrol is 2 times that of CNG, you cannot say that. So 3, 2, 1 has no meaning.They are just 3 categories, right. So, we will, what will you do then if theyare 3 categories and here it is shown that they are integer variables as for the structureof the data, what will you do? You will change it to factor variable, very good. So, we willchange it to factor variable by running the line number 3. See what did it say? For iin 2 to 4. Why 2 to 4? Because second column is fuel, third column is capacity and 4thcolumn is price. So, i will value from 2 to 4, second columnto forth column. What will I say? I will say that data third bracket comma i data, thirdbracket, the moment I put a third bracket after data, what it does?The moment I write some data file name and then I write third bracket, it knows thatI am doing a sub setting. I wrote a comma means whatever I wrote before the coma, thiswas the first module remember. Whatever I wrote before the comma are what? The row names,the row numbers that I would use to subset. And whatever I will put after the comma iswhat? The column numbers, very good, the column numbers that I will use to subset. Now, ifI write nothing after the comma, that means I will take all the columns.And if I take, if I write something before the coma then I will take such rows. So, themoment I write something like I this means that there is nothing before the comma. Thatmeans, I will take all the rows and ith column. What is ith column? When i is 2, it is secondcolumn, when i is 3 it is third column, when i is 4, it is 4th column. So that is whatI am doing. So, data i is equal to, is equal to what? The factor representation of datai. So, you pick up dataÕs ith column, changeit to its factor form and save it back to data comma i means ith column. So, pick upthe fuel column, which is second column when i is equal to 2, it will pick up the fuelcolumn, change it to its factor form. Factor form means 1, 2 and 3 will now be used ascategories and save it back. And that is what I am doing for all the 3 columns. So, thatis what I am doing, running this, it is changing my 3 things, fuel, capacity and price to theirfactor forms.See, these are your now factor variables. So, if they are now factor variables, I canrun the regression. Now remember, each of these factor variable has 3 categories. Fuelhas 3 categories, capacity has 3 categories and price has 3 categories. When I run a regression,because of multicollinearity issues, at least once, so two dummies will be created for eachcategorical variable let us say for fuel, there are 3 categories.Two dummy variables will be created, one will be dropped. And, why? Please go back and studyyour linear regressions, details about some a little bit of econometric background isneeded. Why? Because there is multicollinearity issue. I cannot use more, than I cannot useall the categories of a categorical variable in my model because if I use that, they areabsolutely multi coordinate, they have a multi correlation, 100 percent multi correlation.So, the R square value will probably come as I do not know infinite, VIF score willcome probably like infinite or something like that. So, that is something that we have toavoid. So, how to avoid that? We drop any one of them. So, in this case, what I willdo automatically is R will drop alphabetically whichever is the, so fuel 1, 2, 3, 1 is alphabeticallyfirst then 2 and 3, so it will drop that and fuel 2 and 3 it will give you. So, this isthe similar result we have got when we run the conjoint analysis.So, there are 3 categorical variables. Each has 3 levels. If on each of the categoricalvariable 1 level gets dropped that means there will be 6 columns, 6 categorical variables,6 dummy variables that will be created from these 3 categorical variables. Since dummy,what is this dummy? For fuel, it will be fuel 2 and fuel 3. For capacity, it will be capacity2 and capacity 3 and for price it will be price 2 and price 3.So, I am, here I am creating this kind of a column which has 6 columns and 60 rows andI am creating that, it was in this line where all the entries as a starting point I am writinga 0. So 0, n col is equal to 6, n row is equal to 60, I am creating a matrix. So this isthe matrix, a matrix looks like this. There are 60 rows, all of them are 0. And I am namingthe column names. Col names of matrix 1 I am writing as fuel 2, fuel 3; capacity 2,capacity 3; and price 2 and price 3.So, now if I running and after that if I see that name of the columns got changed. Now,what do I do? I run the regression 60 times. So, each time with the data of the correspondingi. So, let us say for the ith guy remember in this data set in this data set if I amthe ith guy, then what is the starting point? Each guy will have 9 observations. So, ifyou remember, the ith guy will start i minus 1 into 9, that many observations will be beforehim. So, ith guy is let us say 2, is a second guy.That means 9 observations will be before second guy, that means 2 minus 1 into 9, 2 minus1 is 1 into 9. If I am fifth guy, then 4 observations, 4 customers observations is already there.That means 36 observations is already there, which is 4 into 9 or in other words 5 minus1 into 9. So, i minus 1 into 9, these many observations will be, these many rows willbe before ith guyÕs observation starts. So his starting point that is why will beif you see carefully what I wrote, here I wrote data i minus 1 into 9 plus 1, that isthe first observation of ith guy. And what is the last observation of ith guy? I into9. So I run the regression, simple regression simple linear regression Lm rating fuel capacity,price but data is equal to a subset of the data, not the whole data, a subset of thedata. How to write a subset of the data? Check it carefully what did I write? I just I amjust copying this part and pasting it here. Check it carefully what have I written. Now,I am copying this part and removing it. So, data, third bracket record and comma, thatmeans a subset of data. Nothing written after comma means all the columns. Now, what didI write before comma? I wrote this. So, I wrote this value, which is if i is equal to2, that value is 10; if i is equal to 3, that value is 18 plus 1, 19 up to how much? i into9. So, the first row number and the last row number that much rows will be picked up forevery i. And that will be going on, as i changes thesevalues will change. So, I am writing for i in 1 to 60 for each of the customer, you firstrun a regression with the corresponding data set of the customer and save it in fit 1,that is the first job. What is the second job? Now, if let us say I am running it fori is equal to 1. Okay, for i is equal to 1 if I run this, this is how the fit 1 lookslike.This is how fit 1 looks like. So I am saying that now remember there are some values whichI consider to be 0 because these guys are not significant, like this is not lower than0.05. This is not lower than 0.05. So I can consider these to be 0, this to be 0, thisto be 0, this to be 0. So, I am saying that if else if summary feed coefficients is lessthan 5, what is this? This is nothing but the P values, see the P values.The P value of fuel 2, fuel 3 are okay sorry it is it should not be 2 to 4, it should be2 to 6 and here also it should be 2 to 6. So, I will just run once more. So, what isthis thing? This is nothing but, this is nothing but the P value. See, check the P values,the P values are 0.8713, 0.04169, 0.056. Similarly, 0.08713, 0.04, 0.05, 0.08. So, I will onlyconsider such values, I will only consider such values where these values are lower than0.05. If they are lower than 0.05, consider thecoefficients as it is. If they are not lower than 0.05 consider the coefficients to be0. So 2 to 7. 2, 3, 4, 5, 6, 7 that means, total 7 things, I am running this now, so,2 to 7. So, in the actual file that will be given to you, this will be edited 2 to 7 andthen I run this. So, I run this. So, once I run this, the matrix looks like this, thisis the matrix that I got. All the zeros means for customer 2 nothingwas significant, for customer 3 nothing was significant, but there are some customersfor whom some things were significant and corresponding weightages were written herefor each of the customers, it got populated. Now, rest of the things is simple. I willuse this matrix to run a cluster analysis. So, first I will convert this to a data framecalled DF. DF looks like this, same thing. Now, I will use this DF to run a K means kindof clustering.So, first I will plot the screen plot. So, this is the code that I have used there also,I will just run it and this is how the plot comes up and I can see that the kink is in2. Kink is properly coming at 2. You can probably also use 3 or something like that. So, firstI will use clusters is 2, if there are only 2 clusters, what is the meaning I am getting?I will run that and I will see that. There are 2 clusters.The first cluster looks like this. These are the average prevalence of first cluster andsecond cluster, this is the average preference. So, first cluster guys, they do not thinkpetrol to be much more attractive than diesel. Diesel is the reference point, fuel 1 is dieselwhich got dropped. So, these guys do not find petrol to be more attractive than diesel butthey definitely find CNG to be less attractive. Now, this guy is highly sensitive towardsfuel. They think they are very much I would say diesel conscious.Anything which is petrol they do not like, anything which is CNG they absolutely do notlike. So these guys are very much focused on diesel. These guys are not much focusedon diesel; for them diesel and petrol are not different. The coefficient is 0 and thecoefficient for fuel 3 which is CNG is negative but not that very big. On the other hand,this guy is also capacity sensitive. So they want big car, 8 sitter. So, if it is 6 seaterthey do not like, if it is 4 seater, they do not like at all.On the other hand, this guy are also not much sensitive, 8 and 6 seater is okay. And for4 seater, they do not prefer 4 seater, but that preference difference is not much. Similarly,these guys are price sensitive also. So, in other words, I can say group 2 is heavilysensitive on all the 3 aspects and group 1 is not so much sensitive in any of the 3 aspects.I can also find out if what if there were 3 clusters?Let us check, if there are 3 clusters, then I can find out that okay this guy is absolutelynot fuel sensitive, they are not fuel sensitive at all, but heavily capacity sensitive. So,the moment capacity drops below a level, they become very sensitive and the moment it becomesprice goes up, they become very sensitive. This guy is like segment 2 in the previousone, they are sensitive to everything. On the other hand group 3 is highly fuel sensitiveor moderately fuel sensitive and not sensitive to anything else.So, I can find out different groups who have different kind of sensitivity towards differentkind of things. And then if I have some demographic data, I can try to see that what is the demographicdata that predicts that whether you will be group 1, or group 2 or group 3. So, is itsome specific type of income group, a specific type of gender or specific type of culturalbackground that leads to your price sensitivity or fuel sensitivity or everything sensitivity?We can try to analyze that. For that you have to use again kind of regressionor LDA that we have the name the last video. So, that is all for segmentation targetingpositioning. We have done quite a lot of thing. Thank you for being with me. I will come backwith a new module in the next video. Thank you.
Log in to save your progress and obtain a certificate in Alison’s free Segmentation and Demand Focusing in Marketing Analytics online course
Sign up to save your progress and obtain a certificate in Alison’s free Segmentation and Demand Focusing in Marketing Analytics online course
Please enter you email address and we will mail you a link to reset your password.