Loading
Notes
Study Reminders
Support
Text Version

Set your study reminders

We will email you at these times to remind you to study.
  • Monday

    -

    7am

    +

    Tuesday

    -

    7am

    +

    Wednesday

    -

    7am

    +

    Thursday

    -

    7am

    +

    Friday

    -

    7am

    +

    Saturday

    -

    7am

    +

    Sunday

    -

    7am

    +

In this lecture, we look at data in little more detail. We try to find out what types of data exist, and then we try to understand when we use what types of data. So, to understand this let us just look at a small picture where we have a lot of things that are written down. And later we classify them or bring them into data. So, you could find some names here you could find some names, like Akhil or Alipt or Nakul or Priya. You could find names we could also find things like M F which kind of make us understand that they could represent male and female. So, you find numbers like 98.26, 95.27 which might mean something which could either mean average marks. Or perhaps, they could mean some kind of a rank or a percentile or something. And then you find some numbers like 2, 3,2 and so on. So now, all these are data and we will use this as an example to try and understand the various types of data. As well as various categories of data. So, from this, we can understand the data need not only be numbers but even names could represent data, even symbols could represent data. For example, M and F could represent males and females which are symbols or notations to represent data whereas, we have names like Meghna or Nakul. This is some piece of data picked up. And then we now bring all these data into a table. Now we kind of sort and categorize this data. So, we are able to do this we are able to say that there are 5 names so these names are being written and we could associate a male-female with the names which are also written. And then we look at this set of numbers which are like 27, 25, 26. And then we bunch the set of numbers which are 95.27 96.15 and so on, and then we look at 5, 2, 3, 0, 12, and so on. If you look at it very carefully these 25 pieces of data may not be there in the previous picture. But one could say that this type of table can always be drawn from the source from which the data on the previous picture had been taken. Now we have classified this data, we look at this a little more, and then we can give some generic headings as named for the first column which has Meghna, Nakul, Priya. Obviously, the second one could be the gender, and that could represent male and female corresponding to the names that are there. Now, one looks at the third column one can think of many many things that the third column could represent. And perhaps, the third column could represent the age of this person. So, it could represent, the age of this person assuming that these are students from a class, let us say an MBA class or whatever. So, this could kind of represent the age there.The 4th one could represent a score. For example, they could represent a percentile score in an entrance exam based on which they were admitted. And the fifth could possibly represent work experience. Even though something like 12 for a person aged 22 is inconsistent. They just represent some pieces of data that are there. The data that we have are now put in a data table, with each column having a heading, and the data fit into this heading such as name, gender, age score, and experience. So, we also need units for some of them. So, age would be years and experience could be years while the others may not have an explicit unit in which it is measured. The score could be measured as a percentile, and the 12 month is an outlier. Data could have outliers, and we need to collect and compile data carefully. So, one should also be able to understand outliers in data which is shown by the 12 months. Now, once we make a table like this the columns are called variables, such as name, gender, age score, and experience, and the rows are called cases or observations. It is a general terminology that is being used, the columns are called variables the rows are called cases or observations. What type of data are these? For example, columns 1 and 2 represent data that are not numbers whereas, columns 3, 4, and 5 represent data that are numbers. Sometimes it is also customary to represent, in this case, we have M and F representing the gender. At times we could use a different notation like a 1 and 0 and so on. But columns 1 and 2 generally do not have numbers representing the data; whereas, 3, 4, and 5 have numbers representing the data. How do we classify types of data? The first classification is called categorical data and numerical data. If we go back to the first one, the table the first 2 are categorical and the next 3 are numerical. So, categorical data are responses that belong to groups or categories. They could sometimes be a yes-no type of a thing. They could be something like a strongly agrees to strongly disagree and so on. Numerical data uses a numerical value as a response, it could be a discrete number or it could be a continuous number, for example, the number of students in a class height of people in a locality, and so on. The first classification is categorical data and numerical data. Another classification is qualitative data and quantitative data. When you say qualitative data, there is no measurable meaning to the difference of numbers. For example, the number in the shirt of a sports person. You could find one cricketer wearing a number 12, and another cricketer wearing a number 82. They actually do not mean much at all they just describe something. You cannot distinct, while it helps in distinguishing say that if I see the number 12 I know this is the sportsperson, and I see the number 82, I see another person, but there is no way to say that the person wearing an 82 is a senior player compared to the person wearing number 12. Qualitative data are further divided into 2 types which are called nominal data and ordinal data. We also have quantitative data where we can give some meaning to the difference. For example, somebody has scored 80 marks and the other has scored 60 marks. Then instances one can say that this person has scored more than the other, and in some other instance, one could say has scored twice the mark compared to the other. Within the quantitative, we have interval and ratio, within the qualitative have nominal and ordinal.So, there are 4 broad classifications or types of data, nominal data, ordinal data, interval data, and ratio data. Categorical, nominal, no implied order, ordinal order, or rank, numerical data classified to the interval where we can add and subtract, and ratio where we can also multiply and divide in addition to add and subtract. Name is a nominal type of data. No implied order gender is nominal. In this case, you say either male or female, qualitative data, ratio age is a ratio. So, one could say that this person is twice sold as the other so, it is a ratio type of data.The percentile in the qualifying examination is an ordinal type of data, there is an order or a rank, one can say that somebody who got 98.26 had a higher rank than somebody who got a 97.44 at the same time we cannot say that this person has scored say one mark more, cannot say that because these are percentiles, and these only represent a rank of the marks scored. One cannot go back and say, that the person who got 96.15 got one mark more than the person who got 95.27. But what it represents is this person who got 96.15 is in the top 96.15 percent of those who wrote the exam. Whereas the one who got 95.27 is within the top 95.27 of those who wrote the exam, so it is ordinal data. The work experience can be an interval data, one can go back and say that the person who has 3 years of work experience has one more year of work experience than the person who has 2, but it is not very fair to conclude that this person has one and half times is more work experience. We now see that we have nominal, ordinal, interval, and ratio. We find examples of all 4 types of data, given a certain description of data. It is very important to understand what category it comes from. Most of the time have observed that it is just that bit difficult to distinguish between interval and ratio. Ordinal is reasonably all right because you only find a rank nominal easy relatively to kind of identity. Whereas, it is often difficult to distinguish between interval and ratio. So, one needs to just understand this point very carefully that an interval we say add and subtract makes sense ratio all 4 makes sense. The example where we said interval is while we say that the person with 3 years work experience has one more year than the person with 2, it is difficult to say that the person has one and a half times the experience or knowledge. Therefore, we categorize them as intervals. So, it is important to given the type of data to quickly understand what type of these 4 it fits into, and that comes by constant practice and also by understanding the context in which the data has been picked or the data is going to be used.For example, if we could give marks for work experience instead of using years, again one could only look at it as an interval type of data. Another example could be some kind of class work for you. The following data were collected from 100, managers the salary range of salary in the sense say 10 thousand to 20,000, 20,000to 50,000. A car model that they have, the year of graduation years of experience, highest degree number of companies that they have worked what kind of a computer they have which brand number of countries they have visited if they are married or the number of children then they have, and what is their favorite sport. So now, you realize that there are 10 different types of data, and you could try and classify these into the 4 types that we saw nominal, ordinal, interval, and ratio. And we also can give some numerical units for the numerical data. Similarly, if you look at contexts like an MBA admission or a dental clinic or a savings bank or an automobile dealer or a purchasing department and a factory, school, supermarket, database of cricketers, IIT madras faculty profile, or a faculty profile of any educational institution, a museum. So, here we would first you can collect about 10 to 20 types of data in this. And then classify them into nominal ordinal, interval, and ratio.