Notes
Study Reminders
Support
Text Version

### Set your study reminders

We will email you at these times to remind you to study.
• Monday

Tuesday

Wednesday

Thursday

Friday

Saturday

Sunday

hi in this module we are going to seesome aspects of sample size how muchsubjects you need require to do a studyand some of the concepts that goesbehind the calculation of the samplesize the usual question most of theinvestigators they have in their mindwhen they want to rather start doing aresearch study is how much the subjectsthat I should recruit into my study howmany patients I should see how manyhouseholds that I should cover and asimple answer for this question is thereis no simple answer yeah this requires alittle bit of logical thinking like andusually it depends on some of theinformation that we already should knowbefore we start our study so in thismodule we will try to understand what'sthe relationship between sample size andI am going to introduce you to a conceptcalled par and we also try to ratherdetermine the sample size which isabsolutely essential or necessary toachieve a given level of par forestimating may be a simple proportion orany other measures of effect as I wasmentioning to you there is no simpleanswer for estimating a sample size weneed to go in a systematic manner andlet me rather take you step by step inthe process of estimating sample sizefirst of all you let identify what isthe majorstudy variable you are planning to studyin any investigation you will be seeinga number of things you have to identifywhich among them is the most importantvariable which you want rather studyabout say for example when you arestudying on maybe scrub typhus in acommunity you are a maybe to estimatethe prevalence of scrub typhus in whichcase the variable whether a person hasgot scrub typhus or not is a majorvariable suppose if your interest rateis not on the prevalence but on some ofthe associated factors whether a personhas been exposed to a forest orsomething like that in that case thatbecomes the major study variable thenthe second step is to determine the typeof estimate or you going through thestudy a mean or a ratio or a percentageof proportion and because accordingly weneed rather reframe or have a formulafor computing the sample size then oneof the important things that comes outof this is you need to indicate theexpected frequency of factor of interestsay a common census suppose if you aregoing to read the study something veryrare you need to rather have a largesample unless you see a very largenumber of people you may not probablyget a sufficient number of people withthe factor of your interest on the otherhand if you are going to rather studysomething which is very common in thatcase you don't need a large sample evenin a small sample you may be able torather give fairly a good officiate as asufficient and precise estimate of yourfactor of interest then the next factoris the decide precision of the estimatehow precise you want yourestimate to be you want you an estimateto be within 5% this side that side orwithin 10% this side that's a see as youwant do an estimate to be more precisethen naturally you need to rather have alarger sample if you are willing toeither give say plus or minus 20% thenprobably your sample size will be smallas compared to plus or minus 10% thenthe next point is okay I want plus orminus 10% but how sure I want that myestimate is plus or minus 10% what isthe amount of rather risk that I amwilling to accept there a 5% risk or a10% raise so these are all some of theelements that are essential to computethe sample says and invariably these areall the elements that has to be rathergiven by the investigator to whoever iscomputing the sample size the otherthree items that I have rather given areyou have to have just what cooperationsays or you will pick your sample from avery large population are you are goingto rather take from a small populationbecause usually the sample says formallyassume that you are taking sample from avery large population if you are goingto rather take you a sample from a smallpopulation you need to do someadjustment factor in create the next isI just want estimated design effect seein my earlier lecture on sampling Italked to you about a cluster designeffect wherein you are going to ratherselect not individual subjects as yoursample but you are going to ratherselect cluster of subjects as yoursample that could be a correlationbetween the subjects in the same clusterso in order to get over it you need tomultiply your sample size by a factorcalled a design effect so that you havea larger sample which takes care ofthis correlation within these subjectsin a cluster then the last bullet pointin this is are just for expectedresponse you have rather renown youdecide that you want to rather do 300and if you just go under the steady 300maybe you know 10% of them they did notturn up and you who have only 230 so inorder to adjust for that you have somemaybe 10% extra as your sample says sothat assuming a non-response you stillhave sufficient sample to answer yourquestion I am going to introduce you nowto some concepts which are essential tounderstand the computation of samplesize the first one is the Alpha are thetype 1 error the significance level of atest what do you mean by that it is theprobability of rejecting the nullhypothesis when actually it is true inthe statistical parlance it is calledtype 1 error and the confidence level isthe complement of that that is 1 minusalpha and that's naturally theprobability that an estimate of apopulation parameter is within certainspecified limits of the true value thenext are beta and power beta is nothingbut the probability of failing to rejectthe null hypothesis when actually it isfalse so if something is false you haveto reject it but you accept it and theprobability of making this is called atype 2 error and the complement of thisthat is 1 minus beta is commonly denotedas part and which is a correct decisionand that's nothing but probability ofcorrectly rejecting the null hypothesiswhen it is false another importantconcept that goes into the computationof sample sizeis the precision by precision what youmean is it's a measure of how close anestimate is to the true value of apopulation parameterit may be expressed in absolute termsare relative to the estimate they sayplus or minus ten percent are plus orminus ten percent or the estimate nowlet's say they look into a scenariowhere you need to compute a sample sizeand you our interest is to estimate themean of a population so you have asample you have to compute a sample meanand you want to estimate the populationmean from your sample mean and whatshould be the sample size that you needso the general idea of the computationof sample sizes it's always it's areliability coefficient into standarderror is called D and through D we canestimate the sample size young-se the Dis a formula given here is Z into Sigmaby square root of B n how do we get thatwe get that mainly using the concept ofa sampling distribution what do you meanby sampling distribution suppose youknow I take several samples of the samesize and each of the sample give anestimate of the population mean and if Ihave a distribution of all those samplemeans theoretically it is proved thatthat distribution is a normaldistribution and also the standard errorwhich is the standard deviation of thatstate distribution is given by Sigma byroot n so what we normally we do it isis we try to either have see the usingthe principles of normal distribution toSigma limit of this type lling errorninety-five percent of the values theylie so the easy in the formula isnothing but thesestandard normal deviate for a particularlevel of significance and suppose theidea of company's sample size is you fixthat and then when you fix that you haveonly one unknown namely the yen in thedenominator and if we can solve for then which is nothing but n is equal to isthat square into Sigma square by dsquare then you get an idea of what theN is I think you know this concepts willget clearer if you see an example thehealth department nutritionist he wishesto do a survey among a population ofteenage girls to determine the averagedaily protein intake so that's theresearch problem and what information isneeded to estimate the sample size saythe nutritionist must provide threeitems of information the first one thedesired width of the confidence intervalthe next one is the level of confidencedesire I am he should rather give ruffmagnitude of the population varianceassume that he gives them all say thenutrition feels that ok the 10 unitsthis air on that side is what he isexpecting which means no 10 units on thewhole so five units this side at 5 unitsthat side and the confidence coefficientof 95% is decided upon and from his pastexperience from the literature reviewthe nutritionist feel that thepopulation standard deviation isprobably about 20 grams so now we havethe information Z is 1.96 because 95%confidence interval has got acorresponding Z value of a normaldistribution 1.96 Sigma is already givenas 20 and then the desired length B is 5units this error that say if you plug inall these values into theit becomes n is equal to one point ninesix square 20 square multiplied thesetwo divided by Phi square it comes tosixty one point four seven which meansyou need to have at least sixty twoteenage girls in order to get anestimate of the mean protein intake andthe estimate that you give ninety-fivepercent of the time will be within fiveunits they said are that same of thetrue mean population mean of proteinintake now in this formula we have Sigmathat is the population variance and inmost scenarios you may not probably knowa value of Sigma because this is steadyyou are going to rather do Sigma may notbe available and how to get a Sigma oneof the way that you can get thisvariances is to a pilot surveypreliminary survey and of course you caneven use this observation used in thepilot for your final sample two and anestimate available from the pilot surveycould be used or you can use an estimatewhich is available from previous studiesand suppose you know you have a largedata available and you have the range ofthe data assuming that the it's normallydistributed you can get an approximatevalue of Sigma as the range divided bysix so these are all ways of getting thevalue of Sigma in your formula nowsuppose we are going to estimate aproportion not a mean the formula ismore or less similar but what you needrather give is you must rather have aknowledge of P that is the proportion ofthe characteristic or the factor ofinterest in the population this also maynot be C you are going to do a study toestimate the properand invariably when I asked thisquestion to the investigator he say sirI'm going to rather do a study to findit how can I have an idea of pee-canprobably as a Hydra that mentionedearlier can probably do a pilot study toget an idea of P or you can get from themiddle a treasure what could be thevalue of P and if it's impossible thenthe best thing is trust you may estimateto get a value of P as point five sothat it is the maximum value of n so theformula for that is very simple the N isthe Z squared and the d square in thedenominator are common and here insteadof your Sigma what you have is this theP Q where Q is nothing but 1 minus Pthis also will be clearer if we see anexample suppose want to estimate thetrue immunization coverage in acommunity of school children previousstudies tell us that the immunizationcoverage should be somewhere around 80%suppose they procedure the absoluteprecision we would like the result to bewithin 4% of the true values then theconfidence interval which isconventionally taken as 95% and 1 minusalpha therefore there are 5 percentalpha level the Z alpha is 1.96 then wehave all the values that are needed forour calculation D the absoluteprohibition precision is 0.04 P theexpected proportion of population is 0.8so naturally the Q must be 0.2 then isit alpha is 1.96 and when we plug themall into a formula what we get is 384 soyou need 384 subjects to get an estimateof the immunization coverage within 4percentage this error that site and youhave 95% confidence that the true valuelies in this particular intervalso we had seen something about the Italked about the design effect earlierthat define the desire effect is causedbecause of a bias in the varianceintroduced in the sampling design byselecting subjects whose results are notindependent from each other because in acluster there may be you know suppose achild is immunized in the first houserole there is a family a large chancethat this child in the next house alsohis human eyes you can't say that itwould be absolutely independent so inorder to account for that you might needto multiply your sample size by a factorcalled design effect then suppose youknow there is similar sort of a scenariosuppose we are going to rather do thatcase control study are a cohort studyand how should you should go about inestimating the sample size so what youneed is is the desired value of theprobabilities of alpha and beta and theproportion of baseline or that is thecontrol or non exposed population and inthe case of case control studies theproportion of exposure or in the case ofcohort studies the proportion of diseaseand you need to either have some idea ofthem and these are all often based onprevious studies or reports and also youshould have a some sort of an idea ofthe magnitude of the expected effectthat is the magnitude of relative riskor the odds ratio this again is based onprevious studies or reports and what's aminimum effect that investigatorconsiders what's detecting see these areall some of the information once it'sprovided then there are very easyformally available the differentformulae depending on the study designresearch question and the type of datanowlet's rather take few examples and thisexample should rather give you some ideaof how the sample size is computed indifferentCheerios take for example a cohort studyof oral contraceptives use in relationto the risk of myocardial infarctionamong women of childbearing age soprevious studies have indicated that theproportion of non OC users who are atrisk of disease is 0.15 that is 15percent of non OC uses women ofchildbearing age I have had risk ofmyocardial infraction so proportion ofOC users who are at risk of disease is0.25 and say the conventional alpha is0.05 and suppose the beta is taken as 200.20 that is you want 80% power todetect the difference of the if it trulyexists and assume that you are going torather have equal sample sizes for yourusers and non-users then the formula isobtained using this following parametersyou know P naught which is nothing butproportion of non OC users who aredisease which is given as point 1 5 P 1is proportional was user who arediseased which is given as 0.25 and yourQ naught is a complement of P naught is0.85 your Q 1 is a complement of P 1which is 0.75 is that alpha is 1.96 wesaw in the last example is it beta is0.8 4 so we have all these values whichcan be plugged into the formula whichgives n is equal to two hundred andforty six point nine six are 247 so weneed to have 247 OC uses and 247 nonwoes he uses follow them over a periodto get a desired resultnow let's rather take an example of acase control design how do you go aboutthe case controlcity of water contraceptives use inrelation to the risk of myocardialinfraction among women of childbearingageprevious studies says 10% of women useRoces and/or of mi associated withcurrent OC uses 1.08 then the otherthing as conventional alpha is 0.05conventional beta 0.20 and assumingequal size for case and control and seeyou have all these parameters P naughtis equal to the proportion of controlswho are current was EU verses which is0.1 P 1 is equal to proportion of caseswho are current Rossi uses and that is0.18 Q naught is point warning from 0.9and Q 1 is 0.8 to this Z alpha 1.96 isit beta is 0.8 for if you plug them allin a formula then you get 291 point ohsix in which case a it indicates thatyou need to rather have two nine to oneor two 92 cases and 292 controls inorder to get a estimate of your robotnow this slide gives you the requiredsample size for various war say forexample you want to detector and over of1.2 then you need a sample size of 3834whereas you have to estimate an over ofthree it is enough you have 59 in eachgroup so what it means is is if you wantto detect a very small difference thenyou need to have a large sample toidentify that small difference if youwant to detect a large difference thenit is enough you have a small samplethen you know you will be able to get anestimate of your robot which is three ormore then see in any of these analyticalstudies when you are looking for anassociation of one variablethat could be a third factor which couldbe affecting the values of thisassociation which is in theepidemiological parlance we call it asconfirmed as that could be one variableor two variable which are confirmed asin a particular Association so thegeneral rule is if you have someconfounders in your studies you hikeyour sample size by 10% for everyconfronted variable that you have andnow having seen different scenarioswhere the sample size is computed andthen the concepts behind it I'm going tointroduce you to two software's whichare free software's in the open sourcewhich can be very easily used to computesample size of different study designsone is called open ap and this softwareis supported by the CDC Atlanta and thewebsite for that iswww.opensubtitles.org on what sort of astudy design that you have you can plugin the values that the software askedand you will get the desired sample sizeanother one is Scorpius that is powerand sample size calculation this is bythe Department of biostatisticsVanderbilt University this is also anopen source software this is also fairlyuser-friendly software where you canrather compute the sample sizes so mywhenever you do a an investigation andwhen you compute the software you givethis software in the new referencesaying that you use this particularsoftware and these are all theassumptions or this or all the valuesthat you had rather plugged in in thissoftware so that this is my sample sizethis should be reflected in your methodssectionso to recap tothe module sample says there is no magicnumber as a sample size available samplesays have to be computed using variousparameters that are supplied by theinvestigator investigators may have somean idea or fit ready-made if it doesn'trather have those ideas you may have toprobably do a pilot study to get thisout of an idea and then you know itdepends on how much risks that you arewilling to rather take how muchprecision that you want on yourestimates and this is usually there isno fixed number we can always negotiatedepending on the resources that areavailable in terms of money and time andsupposing I say you need to rather have300 and you don't have that muchresource and you don't have that muchtime to rather do you can always ratherthan no reduce the sample size but youshould know what price that you aregoing to rather pay for reducing you mayhave whether compromise on the precisionor the risk that you will be rathertaking on these sort of estimates thankyou so much