Loading
Notes
Study Reminders
Support
Text Version

Introduction to Statistics

Set your study reminders

We will email you at these times to remind you to study.
  • Monday

    -

    7am

    +

    Tuesday

    -

    7am

    +

    Wednesday

    -

    7am

    +

    Thursday

    -

    7am

    +

    Friday

    -

    7am

    +

    Saturday

    -

    7am

    +

    Sunday

    -

    7am

    +

In this first lecture, we define what is statistics, where, and how it is going to be used. So, we asked the first question what is statistics? The answer is that statistics answers questions using data or information about a situation. There is also this word called statistic, and you can observe that the s is missing. The statistic is also commonly used and a statistic is a property of data. For example, a simple average is a statistic, the median is a statistic. The statistic is a property of data or a parameter that represents data in some form.Statistics which is the field that we are talking about is the art and science of extracting answers from data. Therefore, we understand that data is very, very important to learn and understand statistics. So, why do we study statistics? Statistics helps in decision-making in an uncertain environment. There are times it also helps decision making in certainty. Primarily the purpose is to make good decisions and to make good decisions with data. Because decisions made using data are important and can be consistent compared to decisions that are made through opinions. Therefore, we need to make decisions using models involving data, and statistics as a field of study provides us with models methods that help us make good decisions using data. Therefore, we collect and analyze data to make the decisions. At times we collect data and we will not be able to cover the entire population as it is called. So, we collect data from samples we collect data from subsets of the population. We collect data from samples and then try to understand something about the population by analyzing the data that is collected from or collected using samples. So, what are population and sample? The population is a complete set of all the items that interest an investigator. Population size generally denoted by capital N or uppercase N, can be very large and at times even infinity. For example, if we want to look at what is the average height of people in the world, then we realize that the population is large, very very large. Whereas, a sample is an observed subset of the population. It is also important to note the notation that we have used. So, a sample has a small n or a lowercase n; whereas, the population has a big N or capital or uppercase N. That leads us to simple things like how our samples were chosen are they are chosen randomly, sometimes are they chosen systematically, and many other ways or what is a random sample. We also need to understand two more words commonly used - a parameter and a statistic. We already used or saw the word statistic, we saw the meaning of the word statistic. Now we go on the parameter is a characteristic of the population. The statistic is a specific characteristic of a sample. For example, if the average is what we are looking at, the population average will be a parameter. And sample average will be a statistic. Population average we use the notation µ; whereas, sample average we would use the notationx̅. So, we would be dealing with statistics like sample averages and so on. And there are models where we can try and estimate parameters of the population using data collected from samples, or using statistics from samples. We just start with a simple exercise to understand a couple of things. Let us say an airline claims that less than 5% of it is flights from Delhi airport depart late. From a sample of 100 flights, it was observed that 6 flights were found to depart late. Let us look at this sentence and try to understand, what is the population? What is the sample? What is the statistic, for example, 6% is a parameter or a statistic? So, let us read the sentence again, an airline claims that less than 5% of its flights from Delhi airport depart late. So, we can assume that the population, in this case, are all the flights that depart from Delhi airport.Out of these 100 flights, data on 100 flights were collected, which means the sample, in this case, is 100. Small n is 100 capital N is large very large. And then it was observed that 6 flights were found to depart late. The 6 flights is actually a statistic that we found from the sample. And so, the answers are population, in this case, are the all the flights that depart from Delhi airport. Sampled capital a small n equal to 100 are the flights for which data has been taken; 6 percent that is observed assuming that this 6 percent or 6 flights out of 100 were observed then it becomes a statistic. Now, going back to the field of statistics, we have 2 broad types of models in statistics, these are called descriptive statistics and inferential statistics. So, descriptive statistics use graphical and numerical procedures to summarize data and to transform data into information. Inferential statistics provide a base for forecast predictions and estimates which are used to transform information into knowledge. So, we will begin with learning descriptive statistics. And in this particular course, whatever statistics we are going to look at are descriptive, and we will not be looking at inferential statistics in this course. So, an example of descriptive statistics, example number of customers so, visited a jewelry shop in the last 10 days is given - 83 80 79 85 and so on. We can describe this data in many forms we can try to understand something from this data. For example, one could say that looking at this data, the first 5 days we did not find too much of a variation. Whereas the next 3 days, we found a lot of variation. Among the 3 days, there was little variation, but compared to the first 5 there is an increase and then there is a reduction. So, some things that we could observe are the first 5 days could be Monday through Friday. The next 2 could be a Saturday Sunday. The third perhaps could be a holiday, therefore, the number of people increased, and then went back to working days and so on. So, we can try and describe something from the data that we actually have. How do we do it? A little more formally we will see as we move along in this course. So, some simple things about making inferences. Estimating a parameter average age of customers. Testing a hypothesis for example, is it true that weekend sales are higher than weekday sales a number of people who visit the shop during the weekends and holidays are much higher than those who visit during the normal days. Another inference could be how do I make a forecast of the sale for the next month using some past data or old data. So, these are some examples of making inferences. And as I pointed out we would not be looking at models to do this in this course.Whereas, we would be looking at models that would describe the data, for example, maybe the first part finding the average and so on. Now, what more can we do with data? The first thing that data does or we do with data is to compare. For example, we can compare a 6 feet 3 boy to a 5 feet one boy, and say that this boy is taller considerably taller. We could compare a student with a CGPA of 9.4 with another student with a CGPA of7.8. And perhaps come to a conclusion or come to a decision, that the student with the CGPAof 9.4 has performed academically better than the student with the CGPA of 7.8. We can compare 2 people, one would one having an income of 24 lakhs per year to another who has an income of 8 lakhs per year, and then conclude that the first person is earning more than the second person. We could compare different types of cars, and then form a certain judgment saying that this person has a costlier car that is costlier than the other car. We could compare a 70-year-old woman to a 20-year-old woman and compare that this person is older than the other.We could compare 2 people, one could be a minister the other could be a professor and say that they have different professions. Each of them enjoys a certain privilege in society and takes part in certain types of decisions that would benefit society. So, data helps us to compare, and that is we have given you examples of how you can use different data to compare. Data also helps us to infer or interpret. Going back to the same example, the 6 feet 3-inch boy is taller than the other. The CGPA of 9.4 can be taken as more intelligent than the other. Though one could say has performed better than the other. The 24 lakh person can be said as richer than the person whose income is 8 lakhs. The person who drives a better car, one can say that the affordability is higher, and one could compare the health of a 70-year-old person to a 20-year-old person, and one could compare the power that a minister has with respect to what a professor would have. Therefore, data helps us to infer or interpret it.We would also times this data helps us to answer questions. And some of these questions could be how do I price this car or how do I price an air ticket. How much the customer is willing to pay for something? Where should my admission cut off be if I am doing admission for a course? How tough should my question paper be, if I am a course instructor, when should I offer a discount, if I own a shop and I sell things? What should be the capacity of the manufacturing plant? And how much to advertise in when and I have an event like a world cup or whatever. All these questions are also answered using data, and therefore, we have given you a sample of these kinds of questions. There are a couple of more things that we need to look at one should understand that there is a lot of variation in the data. And all the examples that we saw where we looked at data and then did some simple inferences, they also the comparison essentially price to capture the variation in the data. So, variation in height, variation in weight, variation education level affordability health wealth intelligence, and so on. The other aspect that we have to look at are some dependencies resulting in the model building do these parameters, have a linear behavior or non-linear behavior or do different models require different types of data. We need to understand all these aspects as we move along. So, we could think of at this point, what kind of data would be required for planning events. A simple example could be one could think of if an educational institution wishes to have interviews to select MBA students in Mumbai, what kind of data do we require? Would be a good exercise to understand the number of things that we have seen till now. It could, for example, begin with I have just given some examples, it could, for example, begin with the timing it could begin with the number of days..The number of students who are going to be called for interview number of days the interview that possible places, if for example, an IIT is doing it would we do it in another IIT. Or we do it in some other place that is available some other institutions where some space is available. It could also depend on, as I said the number of students or candidates is going to be called for the interview, the timings, the location.So, all these would result in different kinds of data that is required to carry out an exercise. Another example could be a student aspiring to study MBA and might want to ask a question which is the institutes to apply for MBA.So, what kinds of data are required there? So, the list of colleges that offer an MBA, the qualifying examinations for each one of them, are there multiple exams or do all of them go through the same entrance exam. The fees that these institutions would charge, the number of seats that are available in each one. The importance is given to various aspects such as work experience. So, a good exercise at this point is to sit and write about 10 pieces of data that is required for any situation. And I have just described 2 situations right now. So, we could think of several business examples for which we could do this exercise for example, if you looking at conducting a big event such as the IPL. One could go back and write about 20, 30 different types of data. That could be required to make any decision on this. What could be the data required? One has to understand the dependencies on the data, and one could also look at even player auctions as separate.