Content
Basics. 1
Workbook.. 2
Excel links. 4
Calculator.. 4
Analysis. 7
Statistics. 7
P values. 8
ConfidenceIntervals. 10
Degrees of freedom.. 11
Epidemiology.. 12
Causality.. 13
Bias. 14
Confounding.. 15
Prospective vs. retrospectivestudies 15
StatsDirect combinesnumerical tools with help to assist you in the design, analysis, inference andpresentation of quantitative research. It is not a substitute for astatistician. The best statistical practice is achieved by co-operation ofinvestigator and statistician, starting at the planning stage of a study. StatsDirect can help an investigator to understand thebasics of statistical practice and to carry out the most commonly usedanalyses. In this way, StatsDirect improvescommunication between investigator and statistician. We appreciate that manyinvestigators do not have the resources to consult with a statistician;therefore, we have used this help text to address statistical misconceptionsthat investigators commonly present to us. StatsDirectdata input and result screens use as little statistical jargon as possible inorder to improve understanding by the non-statistician.
The StatsDirect helpsystem is not meant to replace statistical textbooks or emulate their style ofpresentation. This help system is designed for on-screen use within StatsDirect. For further reading we recommend that you seekout key references listed in the reference list.
Sections of StatsDirecthelp relating to common situations are:
Statistics
Epidemiology
UnderstandingP values and degrees of freedom
Understandingconfidence intervals
Causality, biasand confounding
Retrospective vs. prospective studies
Reference list
Statistical method selection
Contacts
The StatsDirectworkbook operates like spreadsheet software such as Microsoft Excel. If youknow how to use Excel then you will find your way around the StatsDirect worksheet easily.
This page gives the most basicinformation that you need to get started using the StatsDirectworkbook. Please also read the following sections in order to get the most outof managing your data in StatsDirect. You may wish toprint out and digest the first three of following sections.
WORKBOOK AND WORKSHEET BASICS
WORKING WITH DATA IN WORKSHEETS
FORMATTING WORKSHEETS
WORKSHEET FUNCTIONS
INTRODUCTION
A workbook consists of worksheets that aredivided into rows and columns forming a matrix of cells. You may enter data orformulae into cells. The active cell is highlighted with a rectangular border,you can move the active cell by clicking on another cell with the mouse or byusing the cursor keys (arrows).
ENTERING DATA
Numbers entered can range from -1E-307 to1E+307. Text is any sequence of letters and numbers which the workbook does notrecognise as another form of data. If you want toenter numbers as labels then you must put quotes around them, i.e."2" is treated as a string and 2 is treated as a number. See dates and times for more information on thesedata forms. Logical data are true or false, true is represented by 1 and falseby 0 in any analysis you perform.If you want to enter coded data such as M, F, Male, female etc. then please usethe search and replace function to convert them tonumerical codes before analysis.
COPYING AND MOVING DATA
Most spreadsheet software enables you to selectcells and copy/paste/delete/move them. Selected cells are displayed with ablack background instead of white. To select an entire sheet you can click onthe top left hand cell where row headings intersect with column headings. Toselect a single range of cells hold down the left mouse button and drag themouse over them, or hold down the shift key and use the arrow keys to highlightthe range. To select more than one range at a time hold down the Ctrl key andselect ranges as described. Once you have the range(s) you want selected youcan copy, delete, paste or move the data. To copy the selected range(s) tomemory, hold down the Ctrl key and press the C key. You can then hold down thecontrol (Ctrl) key and press the V key to retrieve the range and paste theminto the same workbook at a different location, another StatsDirectworkbook or an external application such as Microsoft Excel. Data can be copiedfrom external spreadsheets into StatsDirect in thisway also. To move a selected range, position the mouse cursor over a border ofthe range and drag the range using the mouse. You can also use cut and pasteoperations to move ranges. The delete key clears a selected range. If youaccidentally delete or move data then press Ctrl+Z orselect "Undo" from the edit menu to undo the change.
LABELLING COLUMNS
StatsDirect analysesmost data in columns. The label for a column can be put in the first row or inany other area of that column which might be selected as a sub-column to analyse, i.e. one column of the workbook can contain morethan one column for analysis. If a selected range of cells does not contain astring at the top then StatsDirect uses the workbookcolumn label as its heading otherwise the first string in a selected range isused as the label for that selection. The test data file supplied with StatsDirect uses text strings in row 1 as labels for eachcolumn. You can set the header label of a column by double clicking on it withthe left mouse button. Row labels can be set in the same way.
ENTERING FORMULAE
If you are familiar with using formulae inspreadsheets such as Microsoft Excel then you will be able to use formulae in StatsDirect workbooks without further instruction. An entryis treated as a formula if you start with the equal to sign "=", i.e.=A1-B1 gives the subtraction of column 1 row 1 from column 2 row 2. A range ofcells is represented by a colon, i.e. A1:B10 is the rage from column 1 row 1 tocolumn 2 row 10. If you want to repeat a formula butmake it relevant to the columns below or to the right then enter the firstformula and move the mouse cursor to the bottom right hand corner of the cell.Now hold down the left mouse button and drag down; you will see that the formulaebelow are not just copies of the parent formulae but that all cell referenceshave been translated to those relevant to the particular row/column you havedragged to. Here the cell references change because they are relative, if youwant to make them absolute then put a dollar sign "$" before eitherpart of the reference, i.e. $A1 is absolute column 1 relative row 1 and $A$1 isabsolute column 1 row 1. See Worksheet Functions for information on thefunctions you can use in formulae.
INTS
1. Confidenceintervals (CI) are very useful in statistical inference. StatsDirect places strong emphasis on confidence intervalanalysis. Wherever possible, the most exact method for the CI has been used.Before calculation of a CI a dialogue box asks you to select a coefficient ofconfidence. The default 95% confidence level is selected routinely when youpress the enter key. You can turn off this dialogue box using the optionsselection of the analysis menu; in this situation a 95% CI is selectedautomatically.
2. Some of the StatsDirectfunctions are time consuming. When a process is taking an appreciable amount oftime the mouse pointer changes from an arrow to an hourglass and a progressmeter is often displayed.
StatsDirect can sharedata with Microsoft Excel in two ways:
A. Reads Excel compatible files directly
B. Links via the StatsDirect-Excellink add-in (for Excel 5, 95, 97 or 2000).
Reading Excel filesinto StatsDirect
StatsDirect can readMicrosoft Excel compatible files directly: use the Open item in the File menu.
Using the StatsDirect-Excel link add-in with Microsoft Excel 5, 95,97 or 2000
The first time you run StatsDirectit will look to see if you have a compatible version of Microsoft Excel, if youdo then StatsDirect will install an add-in into Excelthat gives you a new menu in Excel called "StatsDirect".This add-in provides a data link between Excel and StatsDirect.If you are working in Excel and wish to analyse datain StatsDirect then all you need do is select "StatsDirect" from the "StatsDirect"menu in Excel. When a workbook is transferred from Excel to StatsDirectit will be labelled "~Excel" in StatsDirect.
If the automatic installation of the Excel-StatsDirect data link add-in fails then you can add itmanually: start Excel, goto the "Tools"section of the menu, then to "Add-Ins", then to "Browse"and search for StatsDirectExcelLink.xla in thedirectory where you installed StatsDirect (usuallyC:\Program Files\StatsDirect).
The StatsDirectExcelLink.xlafile is copied to the current user’s startup directory for Excel, which meansthat it loads when Excel starts up. You can switch this on or off manually viathe Tools_Setup Tools menu.
Menu location: Tools_StatsDirect Calculator.
The StatsDirectcalculator can be used both within StatsDirewww.lindalemus.com/pharm/ct andindependently as a replacement for the Windows calculator. The calculatorevaluates expressions in the form of simple arithmetic or more complex algebra.
All calculations are performed in IEEE doubleprecision.
The "save" button copies theexpression currently evaluated and its result to a list from which you canselect saved expressions to paste into new ones.
When you close the calculator it will pastesaved expressions and results into a report in StatsDirectif the "save results to report on exit" box at the bottom left of thecalculator is checked.
Constants | |
PI | 3.14159265358979323846 () |
EE | 2.71828182845904523536 (e) |
Arithmetic Functions | |
ABS | absolute value |
CLOG | common (base 10) logarithm |
CEXP | anti log (base 10) |
EXP | anti log (base e) |
LOG | natural (base e, Naperian) logarithm |
LOGIT | logit: log(p/(1-p), p=proportion |
ALOGIT | antilogit: exp(l)/1+exp(l), l=logit |
SQR or SQRT | square root |
! | factorial (maximum 170.569) |
LOG! | log factorial |
IZ | normal deviate for a p value |
UZ | upper tail p for a normal deviate |
LZ | lower tail p for a normal deviate |
TRUNC or FIX | integer part of a real number |
CINT | real number rounded to nearest integer |
INT | real number truncated to integer closest to zero |
Please note that the largest factorial allowedis 170.569398315538748, but you can work with Log factorials via the LOG! function, e.g. LOG!(272).
Arithmetic Operators | |
^ | exponentiation (to the power of) |
+ | addition |
- | subtraction |
* | multiplication |
/ | division |
\ | integer division |
Calculations give an order of priority toarithmetic operators, this must be considered whenentering expressions. For example, the result of the expression "6 -3/2" is 4.5 and not 1.5 because division takes priority over subtraction.
Priority of arithmetic operators in descendingorder
1. Exponentiation(^)
2. Negation (-X)(Exception = x^-y; i.e. 4^-2 is 0.0625 and not -16)
3. Multiplicationand Division (*, /)
4. Integer Division(\)
5. Addition andSubtraction (+, -)
Trigonometric Functions | |
ARCCOS | arc cosine |
ARCCOSH | arc hyperbolic cosine |
ARCCOT | arc cotangent |
ARCCOTH | arc hyperbolic cotangent |
ARCCSC | arc cosecant |
ARCCSCH | arc hyperbolic cosecant |
ARCTANH | arc hyperbolic tangent |
ARCSEC | arc secant |
ARCSECH | arc hyperbolic secant |
ARCSIN | arc sine |
ARCSINH | arc hyperbolic sine |
ATN | arc tangent |
COS | cosine |
COT | cotangent |
COTH | hyperbolic cotangent |
CSC | cosecant |
CSCH | hyperbolic cosecant |
SIN | sine |
SINH | hyperbolic sine |
SECH | hyperbolic secant |
SEC | secant |
TAN | tangent |
TANH | hyperbolic tangent |
To convert degrees to radians, multiply degreesby pi/180. To convert radians to degrees, multiply radians by180/pi.
Logical Functions | |
AND | logical AND |
NOT | logical NOT |
OR | logical OR |
< | less than |
= | equal to |
> | greater than |
Available under the Analysis menu section at all times:
EXACT TESTS ON COUNTS
CHI-SQUARE TESTS
PROPORTIONS
RATES
SAMPLE SIZE
DISTRIBUTIONS
RANDOMIZATION
MISCELLANEOUS
Also available under theAnalysis menu section when aworkbook is active:
DESCRIPTIVE STATISTICS
PARAMETRIC METHODS
NON-PARAMETRIC METHODS
ANALYSIS OF VARIANCE
REGRESSION AND CORRELATION
AGREEMENT ANALYSIS
SURVIVAL ANALYSIS
META-ANALYSIS
CROSSTABS
FREQUENCIES
GRAPHICS
Statistics with an upper case letter S refers tothe science and discipline of Statistics, which can be defined as the measurement of uncertainty.
Statistics with a lower case letter s refers tonumbers that summarise other numbers in some way. Forexample the arithmetic mean or average value of a sample ofnumbers is a statistic commonly used to describe the central location of the distributionof numbers in the population from which the sample was drawn.
The terms sample and population are veryimportant in the language used by Statisticians. Many statistical methods arebased upon drawing a sample at random from a population because it would beimpractical to study the whole population. Samples drawn at random havemathematical properties that have enabled Statisticians to create numericalmethods that measure how uncertain an investigator should be that their samplerepresents the population they are studying.
You should be familiar with the basic conceptsof Statistics before you use this software. Please digest some introductorylearning materials, such as Bland (2000) or selected web sites.
The following are basic elements of Statisticsthat you should understand:
UnderstandingP values and degrees of freedom
Understandingconfidence intervals
Basics
The P value or calculated probability is theestimated probability of rejecting the nullhypothesis (H0) of a study question when that hypothesis is true.
The null hypothesis is usually an hypothesis of "no difference" e.g. nodifference between blood pressures in group A and group B. Define a nullhypothesis for each study question clearly before the start of your study.
The only situation in which you should use a one sided P value is when a largechange in an unexpected direction would have absolutely no relevance to yourstudy. This situation is unusual; if you are in any doubt then use a two sided P value.
The term significancelevel (alpha) is used to refer to a pre-chosen probability and the term"P value" is used to indicate a probability that you calculate aftera given study.
The alternativehypothesis (H1) is the opposite of the null hypothesis; in plainlanguage terms this is usually the hypothesis you set out to investigate. Forexample, question is "is there a significant (not due to chance)difference in blood pressures between groups A and B if we give group A thetest drug and group B a sugar pill?" and alternative hypothesis is "there is a difference in blood pressures between groups A and B if we givegroup A the test drug and group B a sugar pill".
If your P value is less than the chosensignificance level then you reject the null hypothesis i.e. accept that yoursample gives reasonable evidence to support the alternative hypothesis. It doesNOT imply a "meaningful" or "important" difference; that isfor you to decide when considering the real-world relevance of your result.
The choice of significance level at which youreject H0 is arbitrary. Conventionally the 5% (less than 1 in 20 chance of being wrong), 1% and 0.1%(P < 0.05, 0.01 and 0.001) levels have been used. These numbers can give afalse sense of security.
In the ideal world, we would be able to define a"perfectly" random sample, the most appropriate test and onedefinitive conclusion. We simply cannot. What we can do is try to optimise all stages of our research to minimisesources of uncertainty. When presenting P values some groups find it helpful touse the asterisk rating system as well as quoting the P value:
P < 0.05 *
P < 0.01 **
P < 0.001
Most authors refer to statistically significant as P < 0.05 and statistically highly significant as P< 0.001 (less than one in a thousand chance of being wrong).
The asterisk system avoids the woolly term"significant". Please note, however, that many statisticians do notlike the asterisk rating system when it is used without showing P values. As arule of thumb, if you can quote an exact P value then do. You might also wantto refer to a quoted exact P value as an asterisk in text narrative or tablesof contrasts elsewhere in a report.
At this point, a wordabout error. Type I erroris the false rejection of the null hypothesis and type II error is the false acceptance of the null hypothesis. Asan aid memoir: think that our cynical society rejects before it accepts.
The significance level (alpha) is theprobability of type I error. The power of a test is one minus the probabilityof type II error (beta). Power should be maximisedwhen selecting statistical methods. If you want to estimate sample sizes then you must understand all of theterms mentioned here.
The following table shows the relationshipbetween power and error in hypothesis testing:
DECISION | ||
TRUTH | Accept H0 | Reject H0 |
H0 is true | correct decision P | type I error P |
1-alpha | alpha (significance) | |
H0 is false | type II error P | correct decision P |
Beta | 1-beta (power) | |
H0 = null hypothesis | ||
P = probability |
If you are interested in further details ofprobability and sampling theory at this point then please refer to one of thegeneral texts listed in the reference section.
You must understand confidence intervals if you intend to quote P values in reportsand papers. Statistical referees of scientific journals expect authors to quoteconfidenceintervals with greater prominence than P values.
Notes about Type I error:
isthe incorrect rejection of the null hypothesis
maximum probability is set in advance as alpha
isnot affected by sample size as it is set in advance
increaseswith the number of tests or end points (i.e. do 20 tests and 1 is likely to bewrongly significant)
Notes about Type II error:
isthe incorrect acceptance of the null hypothesis
probability is beta
beta depends upon sample size and alpha
can'tbe estimated except as a function of the true population effect
beta gets smaller as the sample size gets larger
beta gets smaller as the number of tests or end pointsincreases
Statisticians stress the importance of usingconfidence intervals (CIs). There is, however, debateover which type of CIs to use and how to best defineand interpret them. In spite of this confusion, you should use CIs to express the results of statistical tests becausethey convey more information than P values alone.
StatsDirectdocumentation uses the common (see below) interpretation of CIs.The CI included with each StatsDirect function isdiscussed in the help text for that function. In order to understand how CIs relate to specific statistical methods, read theinterpretation of CI in the worked examples of StatsDirecthelp text.
The confidencelevel sets the boundaries of a confidence interval,this is conventionally set at 95% to coincide with the 5% convention ofstatistical significance in hypothesis testing. In some studies wider (e.g.90%) or narrower (e.g. 99%) confidence intervals will be required. This ratherdepends upon the nature of your study. You should consult a statistician beforeusing CI's other than 95%.
You will hear the terms confidence interval andconfidence limit used. The confidence interval is the range Q-X to Q+Y where Qis the value that is central to the study question, Q-X is he lower confidence limit and Q+Y is the upper confidence limit.
Familiarise yourselfwith alternative CI interpretations:
Common
A 95% CI is theinterval that you are 95% certain contains the true population value as itmight be estimated from a much larger study.
The value in question can be a mean, differencebetween two means, a proportion etc. The CI is usually, but not necessarily,symmetrical about this value.
Pure Bayesian
The Bayesian concept of a credible interval is sometimes putforward as a more practical concept than the confidence interval. For a 95%credible interval, the value of interest (e.g. size of treatment effect) lieswith a 95% probability in the interval. This interval is then open tosubjective moulding of interpretation. Furthermore,the credible interval can only correspond exactly to the confidence interval ifprior probability is so called "uninformative".
Pure frequentist
Most pure frequentistssay that it is not possible to make probability statements, such CIinterpretation, about the study values of interest in hypothesis tests.
Neymanian
A 95% CI is the interval which will contain thetrue value on 95% of occasions if a study were repeated many times usingsamples from the same population.
Neyman originatedthe concept of CI as follows: If we test a large number of different nullhypotheses at one critical level, say 5%, then we cancollect all of the rejected null hypotheses into one set. This set usuallyforms a continuous interval that can be derived mathematically and Neyman described the limits of this set as confidencelimits that bound a confidence interval. If the critical level (probability ofincorrectly rejecting the null hypothesis) is 5% then the interval is 95%. Anyvalues of the treatment effect that lie outside the confidence interval areregarded as "unreasonable" in terms of hypothesis testing at thecritical level.
See also Pvalues.
Copyright © 1990-2006 StatsDirectLimited, all rights reserved
Downloada free 10 day StatsDirect trial
The concept of degrees of freedomis central to the principle of estimating statistics of populations fromsamples of them. "Degrees of freedom" is commonly abbreviated to df.
In short, think of df as a mathematical restrictionthat we need to put in place when we calculate an estimate one statisticfrom an estimate of another.
Let us take an example of datathat have been drawn at random from a normal distribution. Normal distributionsneed only two parameters (mean and standard deviation) for their definition;e.g. the standard normal distribution has a mean of 0 and standard deviation (sd) of 1. The population values ofmean and sd are referred toas mu and sigma respectively, and the sampleestimates are x-bar and s.
In order to estimate sigma, wemust first have estimated mu. Thus, mu is replaced by x-bar in the formula for sigma. In otherwords, we work with the deviations from mu estimatedby the deviations from x-bar. At this point, we need to apply the restrictionthat the deviations must sum to zero. Thus, degrees of freedom are n-1 in the equation for s below:
Standard deviation in a population is:
[x is a value from thepopulation, is the mean of all x, n is the number of x in the population, is the summation]
The estimate of population standard deviationcalculated from a random sample is:
[x is an observationfrom the sample, x-bar is the sample mean, n is the sample size, is thesummation]
When this principle of restrictionis applied to regression and analysis of variance, the general result is thatyou lose one degree of freedom for each parameter estimated prior to estimatingthe (residual) standard deviation.
Another way of thinking about therestriction principle behind degrees of freedom is to imagine contingencies.For example, imagine you have four numbers (a, b, c and d) that must add up toa total of m; you are free to choose the first three numbers at random, but thefourth must be chosen so that it makes the total equal to m - thus your degreeof freedom is three.
See also:
normal distribution
standard deviation
Epidemiology is the study of the distribution and determinants ofhealth-related states and events in specified populations.
Last's Dictionary of Epidemiology (2000)
Epidemiologists use a richlanguage to describe how they apply statistical methods to the study ofpopulations in order to work out, for example, the causes of diseases.
If a population is exposed to somefactor, called the exposure, the Epidemiologists usually study the relationshipbetween the exposure and relevant heath outcomes, for example cigarette smokingand lung cancer.
A very important question thatEpidemiologists must ask themselves when thinking about a numerical associationbetween some exposure(s) and outcome(s) is "how might I be wrong".The answer is by:
Chance
Bias
Confounding
You should understand the basicconcepts of causality, chance,bias and confounding in order to start to work with epidemiological problems.You should also understand the basic principles of study design, for example prospective vs. retrospective studies.
There are several introductorytext books, either under a cover title of Epidemiology or Statistics (e.g. Bland 2000) and web sites.
Basics
Lots of things can be associatedwith outcomes that we wish to study but few of them arewww.lindalemus.com meaningful causes.
In Epidemiology, the followingcriteria due to Bradford-Hill are used as evidence to support a causalassociation:
1. Plausibility (known path)
2. Consistency (same results if repeat in differenttime, place person)
3. Temporal relationship
4. Strength (with or without a dose responserelationship)
5. Specificity (causal factor relates only tothe outcome in question - not often)
6. Change in risk factor (i.e. incidence drops ifrisk factor removed)
Elwood's criteria are a modernextension of this concept:
1. Descriptive evidence
exposure or intervention
design
population
main result
2. Non-causal explanation
chance
bias
confounding
3. Positive features
time
strength
dose-response
consistency
specificity
4. Generalisability
to eligible population
to source population
to other populations
5. Comparison with other evidence
consistency
specificity
plausibility and coherence
Copyright © 1990-2006 StatsDirectLimited, all rights reserved
Downloada free 10 day StatsDirect trial
Bias is a systematic error thatleads to an incorrect estimate of effect or association. Many factors can biasthe results of a study such that they cancel out, reduce or amplify a realeffect you are trying to describe.
Epidemiology categorisestypes of bias, examples are:
Selectionbias - e.g. study of car ownership in central Londonis not representative of the UK
Observationbias (recall and information) - e.g. on questioning, healthy people are morelikely to under report their alcohol intake than people with a disease.
Observationbias (interviewer) - e.g. different interviewer styles might provoke differentresponses to the same question.
Observationbias (misclassification) - tends to dilute an effect
Lossesto follow up - e.g. ill people may not feel able to continue with a studywhereas health people tend to complete it.
Some strategies to combat bias:
multiple control groups
standardised observations (e.g. blinding (don't know ifplacebo or active intervention) of subject, observer, both subject and observer(double blind) or subject, observer and analyst (triple blind))
corroboration of multiple information sources
use of dummy variables with known associations
See also: confounding.
Copyright © 1990-2006 StatsDirectLimited, all rights reserved
Downloada free 10 day StatsDirect trial
In Epidemiology a confounder is:
notpart of the real association between exposure and disease
predictsdisease
unequallydistributed between exposure groups
A researcher can only control astudy or analysis for confounders that are:
known
measurable
Example: Grey hair predicts heart disease if itis put into a multiple regression model because it is unequally distributedbetween people who do have heart disease (the elderly) and those who don't (theyoung). Grey hair confounds thinking about heart disease because it is not a cause of heart disease.
Strategies to reduce confounding are:
randomisation (aim is randomdistribution of confounders between study groups)
restriction (restrict entry to study of individuals withconfounding factors - risks bias in itself)
matching (of individuals or groups, aim for equaldistribution of confounders)
stratification (confounders are distributed evenly withineach stratum)
adjustment (usually distorted by choice of standard)
multivariate analysis (only works if you can identify andmeasure the confounders)
Copyright © 1990-2006 StatsDirectLimited, all rights reserved
Downloada free 10 day StatsDirect trial
Prospective
A prospective study watches for outcomes, suchas the development of a disease, during the study period and relates this toother factors such as suspected risk or protection factor(s). The study usuallyinvolves taking a cohort of subjects and watching them over a long period. Theoutcome of interest should be common; otherwise, the number of outcomesobserved will be too small to be statistically meaningful (indistinguishablefrom those that may have arisen by chance). All efforts should be made to avoidsources of bias such as the loss of individuals to follow up during the study.Prospective studies usually have fewer potential sources of bias andconfounding than retrospective studies.
Retrospective
A retrospective study looks backwards andexamines exposures to suspected risk or protection factors in relation to anoutcome that is established at the start of the study. Many valuablecase-control studies, such as Lane and Claypon's 1926investigation of risk factors for breast cancer, were retrospectiveinvestigations. Most sources of error due to confounding and bias are morecommon in retrospective studies than in prospective studies. For this reason,retrospective investigations are often criticised. Ifthe outcome of interest is uncommon, however, the size of prospectiveinvestigation required to estimate relative risk is often too large to befeasible. In retrospective studies the odds ratio provides an estimate ofrelative risk. You should take special care to avoid sources of biasand confounding in retrospective studies.
Prospective investigation is required to makeprecise estimates of either the incidence of an outcome or the relative risk ofan outcome based on exposure.
Case-Control studies
Case-Control studies are usually but notexclusively retrospective, the opposite is true forcohort studies. The following notes relate case-control to cohort studies:
outcome is measured before exposure
controls are selected on the basis of not having the outcome
good for rare outcomes
relatively inexpensive
smaller numbers required
quicker to complete
prone to selection bias
prone to recall/retrospective bias
relatedmethods are risk (retrospective), chi-square 2 by 2 test, Fisher's exact test, exact confidence interval for odds ratio, odds ratio meta-analysis and conditional logistic regression.
Cohort studies
Cohort studies are usually but not exclusively prospective, the opposite is true for case-control studies.The following notes relate cohort to case-control studies:
outcome is measured after exposure
yieldstrue incidence rates and relative risks
mayuncover unanticipated associations with outcome
best for common outcomes
expensive
requireslarge numbers
takesa long time to complete
prone to attrition bias (compensate by using person-time methods)
prone to the bias of change in methods over time
relatedmethods are risk (prospective), relative risk meta-analysis, risk difference meta-analysis and proportions