Loading
Notes
Study Reminders
Support
Text Version

Understanding Cases, Variables and Data Types

Set your study reminders

We will email you at these times to remind you to study.
  • Monday

    -

    7am

    +

    Tuesday

    -

    7am

    +

    Wednesday

    -

    7am

    +

    Thursday

    -

    7am

    +

    Friday

    -

    7am

    +

    Saturday

    -

    7am

    +

    Sunday

    -

    7am

    +

In this lecture, we will take some examples to understand these types of data in more detail. It is also important that in statistics the more examples we look at, the more variety of situations we look at, our understanding of the concepts get better. First, we will learn the concepts, and then we also try to apply them to some situations, and we will have some kind of tutorial session on each topic and the first part of this lecture would act as a tutorial where we try to apply and try to solve, some simple problems to understand what we learnt in the earlier lecture. When we create data tables, we have rows and columns the columns are called variables, such as name, gender age, cat score or score in a competitive exam, and work experience could act as variables; while cases or observations are specific to individuals. And if there is a student or a candidate named Akhil, then you can have a case where the name is Akhil, the gender is male, the ages 27. The score is 95.27 and work experience is 5. So, we also saw that cases or observations are rows and variable names are columns, and in general, we will have more cases and observations in a given table than the number of variables. A simple example could be a car owned by 10 people. Another example would be the income of 20 employees. A third situation could be the size of clothes if you go to a garment shop and start looking at shirts, you could find a small, medium, large, extra-large, and so on. You could think of a number of students who are present in a class or who are absent in a class. You could think of education of people such as studied up to high school, a graduate or a postgraduate or a Ph.D. Each of these we try to give a name and indicate the type of variable to which this name belongs to.So, if we take these cars owned by 10 friends, the variable name could be model or brand. For example, the car could be a Ford car or a Hyundai car, or a Maruti as the case maybe. That is the variable name which is the model or brand, and that data is categorical. Now, if we look at the income of 20 employees, the variable name that we can give could be salary or could be income, and the data type is numerical. Now we can understand why it becomes numerical data because we already have learned that numerical data we can add and subtract, we can also multiply and divide. So, if the data is such that we add and subtract which means the difference is interpretable or we can interpret the difference. Then it becomes interval type data. While even a multiplication can be explained then it becomes a ratio level data. When we compare income or salaries, it is also possible to say that salary of person x is more than a certain quantity compared to the salary of person y. And it is also fair to say that person x gets 20 percent more than the salary of person y or gets one and a half times the salary of the other person. And therefore, the income of 20 employees comes under the numerical type of data. Size of clothes as an example, you go to a garment shop you could find at least 4 different sizes which are small, medium, large and extra-large. So, the variable name can be cloth size, and it becomes ordinal in the sense we can rank them. And we can generally conclude that extra-large greater than large, while the large size is bigger than medium size, and medium-size is bigger than small size. It is very difficult to say that if we look at those measurements, and then compare and then say that medium is bigger than small by a certain quantity. To do that we need more data we need more measurements. Just seeing the classification, a small, medium, large and extra-large we can categorize them as ordinal data. And conclude that small size is smaller than medium size, which in turn is smaller than large which in turn is smaller than extra-large. But the extent to which one is smaller or larger is not explained and therefore, it becomes ordinal data number of students absent for a class could be absentees and quickly falls into ratio data. Because if 10 peoples were absent yesterday and 5 people were absent today, it is not only possible to say that today’s absentees were 5 less than yesterdays, it is also possible to say that yesterday twice the number of people were absent. Therefore, both addition, subtraction as well as multiplication, division is possible and therefore, we can call thisas ratio type data. Look at the education of people. There are distinct education levels that were given. So, the variable name could be education level, and it would come under categorical people, and the example given in the previous slide could be studied up to high school, did graduation, did post-graduation, and pursued a Ph.D. we realized that a particular person could fall into any one of these categories. So, it comes as a categorical variable and within that, it could be a nominal variable. Now, let us look at some more examples to understand. Pin codes are examples of numerical data is it true or false, the answer is false and pin codes are examples of categorical data. Even though pin codes are numbers one might immediately think that it would represent numerical data, actually do does not represent numerical data, because we can neither add or subtract nor can we multiply divide, and make meaningful conclusions out of that. And therefore, pin codes are examples of categorical data.The second one would be the cases represents a column in a data table. Cases represent columns in a data table is false. The frequency of time series is the time spacing between the data. So, to answer this question we also need to understand what is time-series data. Time-series data is essentially data measured across time. For example, if we are looking at letting us say an MBA class, and then we could go back and say in the year 2018 we have 70 students in the class. In the year 2017, we had 65 students in the class. Year 2016 we could have 73 in the class and so on. So, we measure something over a period of time, for example, the number of students in a class, sales in 12 months of the year, stock prices in the last weeks, the price of petrol or fuel in 30 days of a month and so on. So, one can give several examples for time series data. We will also see some situations in this course where we look at time series. And with this information let us come back to the question frequency of time series is the time spacing between the data. So, time spacing between the data is the frequency in a time series. The Likert scale represents numerical data. So, the Likert scale is a scale where we say whether or we like something it moves from a very strong like to a dislike. And Likert scale does not represent numerical data, represents ordinal data, and therefore, categorical data. It kind of ranks at the same time. In this scale, we say that we start with strongly agree to agree, and then it goes to strongly disagree, and a person takes one of them given a situation. So, while we can say that strongly agree is a stronger agreement than taking agree. It is difficult to say how strong or measure the difference between the two things. And therefore, it does not represent numerical data, it represents categorical data. Aggregation of data adds more cases; aggregation of data actually reduces the number of cases because aggregation means addition, and as we add we only reduce the number of cases or observations, and therefore, it is necessary to understand that aggregation does not add more cases, it reduces the case. So, if we really want to present data in a more precise a shorter form then we resort to aggregating the data. So, these examples kind of made us understand given different situations; whether the data falls under categorical or numerical and within those subcategories such as ordinal interval and so on. We already saw what time-series is, time series is basically data measured across different points in time and cross-sectional data essentially means looking at the data at a certain instance in time. So, that is the difference between cross-sectional and time-series data.We will now look at these 5 examples to understand whether they are cross-sectional or time series. The first situation would be a company has data on the number of employees who are in the PF scheme, and the amount that they have in their provident fund. Now, this is cross-sectional data, because this data is taken at a certain point, and it and is not taken at different points for comparison. So, the first example is an example of what is called cross-sectional data. Situation 2: about a thousand people were asked if India could win the cricket world cup. Again this is an example of cross-sectional data because at a certain point in time we ask a certain number of people whether something would happen or not happen. Situation number 3, number of people who shopped for more than 5,000on 5 days of a week. So, this is an example of time series data, because the data is measured according to a certain frequency, which is a day, and on 5 consecutive days or 5 days of the week we measure the number of people who shopped for more than 5,000. Situation 4, 100 customers have given feedback, 60 have said excellent, 30 have said average while 10 have said poor. Again example of cross-sectional data, because the statement does not explicitly say that the feedback was collected over different points in time at regular frequencies and so on. We could take this as cross-sectional data. A number of cars, big cars, small cars park in front of a supermarket on 7 days of a week. It is similar to what we saw in item 3, where this data is collected at different points in time and therefore, it comes under time-series data. Times it is necessary for us to understand this classification. Because certain analysis specific to time-series we would be studying later in statistics, maybe not in this course and therefore, we introduce this idea that once we look at data we also need to understand whether it is cross-sectional, which means it is data that is taken at a certain point in time, or it istime series where it is data that has been collected over a period of time.Now, we move to some more aspects of data, and we now try to describe categorical data.So, earlier in the last lecture, we introduced the term called categorical data, and then we classified them further into nominal and ordinal. Now, we try to describe and see how we present categorical data to the user. So, I have just given an example from let us say from cricket, and we have picked up some numbers the data is imaginary data, it does not represent the live data, and let us assume that this question was asked in a cricket website as to who would score more runs in let us say a popular 20-20 tournament. And the users could pick up a certain number, and let us assume that these are the names of these players who were actually voted by people. And let us assume these are the number of votes that were polled by each of these players. For example, player number one let us say polls 45,276 votes or 45276 people believe that player number one listed here would score the maximum runs.So, this is a data table where we have 4 columns, and the first column would be the name of the player, the second column is the number of votes polled, the third column would simply be the fraction of the total votes polled and the 4th column is the fraction represented as a percentage. So, because this represented as a percentage, the percentages let us say would add 200 and the ratios would add to one.So now this is a data table or a frequency table which represents the distribution of categorical variable as a table. And the categorical variable is the number of votes polled.Tthis is one way of presenting categorical data, saying these are the cases and this is the data. And the advantage of this frequency table is that we are able to present all the data that we wish to present. But the disadvantage is this table can become larger as the number of cases becomes large. For example, we already have names here and it would though we present all the information that we wish to present, one gets a feeling that if this runs to a second page or if there are more cases and observations, it becomes difficult to handle this kind of a data.So, one way is to look at a table to present it, while the other is to look at pictures to present this type of data. So, this is a picture that presents the same data in a pictorial form. And this picture now shows the names which are here, and it also shows a bar representing the number of votes that this person has polled. You can see how the person who polled, the maximum is close to about 50 thousand, polls or votes and here is somebody who has about 15,000. A bar chart is a very convenient way of presenting a categorical variable. There are 2 types of bar charts, and this bar chart is called a horizontal bar chart, and the other one which we will see later is called a vertical bar chart. In this these bars represent the number that we wish to present and this number is the number corresponding to the categorical variable. If we take this particular player, then this bar represents the number of votes that this person has got. Now one can get a feeling that this bar chart presents the data in in in perhaps a slightly nice form where we are able to have these bars representing what we actually want to represent. Perhaps a slight disadvantage of this representation is that by looking at this bar it is slightly difficult to say what is the exact number of votes or polls this person has got.One can only say there it is between 40 to 50000 and much closer to 50000, one might get a feeling that this is anything; between 48 to 49000. So, in spite of this, the bar chart is accepted as a very convenient and nice way of presenting a categorical variable.