医学全在线
搜索更多精品课程:
热 门:外科内科学妇产科儿科眼科耳鼻咽喉皮肤性病学骨科学全科医学医学免疫学生理学病理学诊断学急诊医学传染病学医学影像药 学:药理学药物化学药物分析药物毒理学生物技术制药生药学中药学药用植物学方剂学卫生毒理学检 验:理化检验 临床检验基础护 理:外科护理妇产科护理儿科护理 社区护理五官护理护理学内科护理护理管理学中 医:中医基础理论中医学针灸学刺法灸法学口 腔:口腔内科口腔外科口腔正畸口腔修复口腔组织病理生物化学:生物化学细胞生物学病原生物学医学生物学分析化学医用化学其 它:人体解剖学卫生统计学人体寄生虫学仪器分析健康评估流行病学临床麻醉学社会心理学康复医学法医学核医学危重病学中国医史学
您现在的位置: 医学全在线 > 精品课程 > 卫生统计学 > 南方医科大学 > 正文:医学统计学电子教材:Regression and Correlation
    

医学统计学-电子教材:Regression and Correlation

医学统计学:电子教材 Regression and Correlation:ContentRegressionandCorrelationRegressionandcorrelationSimplelinearregressionMultiple(general)linearregressionPartialcorrelationGroupedlinearregressionGroupedlinearregressionwithcovarianceanalysisLine

Content

Book Regression and Correlation

Page Regression and correlation

Page Simple linear regression

Page Multiple (general) linear regression

Page Partial correlation

Page Grouped linear regression

Page Grouped linear regression with covariance analysis

Page Linearity of regression

Page Polynomial regression

Page Linearized regression estimates

Page Probit analysis

Page Non-linear models

Page Principal components analysis

Page Logistic regression

Page Conditional logistic regression

Page Poisson regression

Page Kendall's rank correlation

Page Spearman's rank correlation

Page Non-parametric linear regression

Page Cox regression

Regression and correlation. 2

Simple linear regression. 4

Multiple (general) linear regression. 7

Partial Correlation. 14

Regression in groups. 15

Grouped linear regression with covariance analysis. 15

Linearity with replicates of Y. 20

Polynomial regression. 23

Linearized estimates. 27

Probit analysis. 28

Non-linear models. 32

Principal components analysis and alpha reliabilitycoefficient. 32

Logistic regression. 35

Conditional logistic regression. 42

Poisson regression. 44

Kendall's rank correlation. 49

Non-parametric linear regression. 51

Cox regression. 53

Copyright © 1990-2006 StatsDirectLimited, all rights reserved

Download a free 10 day StatsDirect trial

Regressionand correlation

·Simplelinear & correlation

·Multiple(general) linear

·Regressionin groups

·Principalcomponents

·Linearized

·Polynomial

·Logisticregression

·Conditionallogistic regression

·Poissonregression

·Probit analysis

·Spearman'srank correlation

·Kendall'srank correlation

·Nonparametric linear regression

·Coxregression

Menu location: Analysis_Regression & Correlation.

Regression

Regression is a way of describinghow one variable, the outcome, is numerically related to predictor variables.The dependent variable is also referred to as Y, dependent or response and isplotted on the vertical axis (ordinate) of a graph. The predictor variable(s) is(are) also referred to as X, independent, prognostic orexplanatory variables. The horizontal axis (abscissa) of a graph is used forplotting X.

Looking at a plot of the data isan essential first step. The graph above suggests that lower birth weightbabies grow faster from 70 to 100 than higher birth weight babies. Linearregression can be used to fit a straight line to these data:

Equation: Y = a + bx

·b is the gradient,slope or regression coefficient

·a is the interceptof the line at Y axis or regression constant

·Y is a value for the outcome

·x is a value forthe predictor

The fitted equation describes thebest linear relationship between the population values of X and Y that can befound using this method.

The method used to fit theregression equation is called least squares. This minimisesthe sum of the squares of the errors associated with each Y point bydifferentiation. This error is the difference between the observed Y point andthe Y point predicted by the regression equation. In linear regression thiserror is also the error term of the Y distribution, the residual error.

The simple linear regressionequation can be generalised to take account of kpredictors:

Y = b0 + b1x1 + b2x2 +...+ bkxk

Assumptions of general linearregression:

·Y is linearly related to all x or linear transformations ofthem

·all error termsare independent

·deviations from theregression line (residuals) follow a normal distribution

·deviations from theregression line (residuals) have uniform variance

A residual for a Y point is thedifference between the observed and fitted value for that point, i.e. it is thedistance of the point from the fitted regression line. If the pattern ofresiduals changes along the regression line then consider using rankmethods or linear regression after an appropriate transformationof your data.

Correlation

Correlation refers to theinterdependence or co-relationship of variables.

In the context of regression examples,correlation reflects the closeness of the linear relationship between x and Y. Pearson'sproduct moment correlation coefficient rho is ameasure of this linear relationship. Rhois referred to as R when it is estimated from a sample of data.

·R lies between -1 and 1 with

·R = 0 is no linear correlation

·R = 1 is perfect positive (slope up from bottom left to top right)linear correlation

·R = -1 is perfect negative (slope down from top left to bottomright) linear correlation

Assumption of Pearson'scorrelation:

·at leastone variable must follow a normaldistribution

N.B. If R is close to ± 1 then this does NOT mean that there is a goodcausal relationship between x and Y. It shows only that the sample data isclose to a straight line. R is a much abused statistic.

r² is the proportion of the total variance (s²) of Y that can beexplained by the linear regression of Y on x. 1-r² is the proportion that isnot explained by the regression. Thus 1-r² = s²xY / s²Y.

Copyright © 1990-2006 StatsDirectLimited, all rights reserved

Download a free 10 day StatsDirect trial

Simplelinear regression

Menu location: Analysis_Regression & Correlation_SimpleLinear & Correlation.

This function provides simplelinear regression and Pearson's correlation.

Regression parameters for astraight line model (Y = a + bx) are calculated bythe least squares method (minimisation of the sum ofsquares of deviations from a straight line). This differentiates to thefollowing formulae for the slope (b) and the Y intercept (a) of the line:

Regression assumptions:

·Y is linearly related to x or a transformation of x

·deviations from theregression line (residuals) follow a normal distribution

·deviations from theregression line (residuals) have uniform variance

A residual for a Y point is thedifference between the observed and fitted value for that point, i.e. it is thedistance of the point from the fitted regression line. If the pattern ofresiduals changes along the regression line then consider using rankmethods or linear regression after an appropriate transformationof your data.

Pearson's product momentcorrelation coefficient (r) is given as a measure of linear association betweenthe two variables:

r² is the proportion of the total variance (s²) of Y that can beexplained by the linear regression of Y on x. 1-r² is the proportion that isnot explained by the regression. Thus 1-r² = s²xY / s²Y.

Confidence limits are constructedfor r using Fisher's z transformation. The null hypothesis that r = 0 (i.e. noassociation) is evaluated using a modified t test (Armitage and Berry,1994; Altman, 1991).

Pearson's correlation assumption:

·at leastone variable must follow a normaldistribution

The estimated regression line maybe plotted and belts representing the standard error and confidence intervalfor the population value of the slope can be displayed. These belts representthe reliability of the regression estimate, the tighter the belt the morereliable the estimate (Gardner and Altman,1989).

N.B. If you require a weighted linear regression then please use themultiple linear regression function in StatsDirect;it will allow you to use just one predictor variable i.e. the simple linearregression situation. Note also that the multiple regression option will alsoenable you to estimate a regression without an intercept i.e. forced throughthe origin.

Example

From Armitage and Berry(1994, p. 161).

Test workbook (Regressionworksheet: Birth Weight, % Increase).

The following data representbirth weights (oz) of babies and their percentage increase between 70 and 100days after birth.

Birth Weight

% Increase

72

68

112

63

111

66

107

72

119

52

92

75

126

76

80

118

81

120

84

114

115

29

118

42

128

48

128

50

123

69

116

59

125

27

126

60

122

71

126

88

127

63

86

88

142

53

132

50

87

111

123

59

133

76

106

72

103

90

118

68

114

93

94

91

To analysethese data in StatsDirect you must first enter theminto two columns in the workbook appropriately labelled.Alternatively, open the test workbook using the file open function of the filemenu. Then select Simple Linear & Correlation from the Regression andCorrelation section of the analysis menu. Selectthe column marked "% Increase" when prompted for the response (Y)variable and then select "Birth weight" when prompted for thepredictor (x) variable.

For this example:

Simple linear regression

Equation: % Increase = -0.86433Birth Weight +167.870079

Standard Error of slope =0.175684

95% CI for population value ofslope = -1.223125 to -0.505535

Correlation coefficient (r) =-0.668236 (r²= 0.446539)

95% CI for r (Fisher's ztransformed) = -0.824754 to -0.416618

t with 30 DF = -4.919791

Two sided P < .0001

Power (for 5% significance) =99.01%

Correlation coefficient issignificantly different from zero

From this analysis we have gainedthe equation for a straight line forced through our data i.e. % increase inweight = 167.87 - 0.864 * birth weight. The r square value tells us that about42% of the total variation about the Y mean is explained by the regressionline. The analysis of variance test for the regression, summarisedby the ratio F, shows that the regression itself was statistically highlysignificant. This is equivalent to a t test with the null hypothesis that theslope is equal to zero. The confidence interval for the slope shows that with95% confidence the population value for the slope lies somewhere between -0.5and -1.2. The correlation coefficient r was statistically highly significantlydifferent from zero. Its negative value indicates that there is an inverserelationship between X and Y i.e. lower birth weight babies show greater %increases in weight at 70 to 100 days after birth. With 95% confidence thepopulation value for r lies somewhere between -0.4 and -0.8.

regression and correlation

P values

confidenceintervals

Copyright © 1990-2006 StatsDirectLimited, all rights reserved

Download a free 10 day StatsDirect trial

Multiple(general) linear regression

Menu location: Analysis_Regression & Correlation_MultipleLinear.

This is a generalisedregression function that fits a linear model of an outcome to one or morepredictor variables.

The term multiple regression applies to linear prediction of one outcome fromseveral predictors. The general form of a linear regression is:

Y' = b0 +b1x1 + b2x2 + ... + bkxk

- where Y' is the predicted outcome value for the linear modelwith regression coefficients b1 to k and Y intercept b0 when the values for thepredictor variables are x1 to k. The regression coefficients are analogous tothe slope of a simplelinear regression.

Regression assumptions:

·Y is linearly related to the combination of x or atransformations of x

·deviations from theregression line (residuals) follow a normal distribution

·deviations from theregression line (residuals) have uniform variance

Classifier predictors

If one of the predictors in aregression model classifies observations into more than two classes (e.g. bloodgroup) then you should consider splitting it into separate dichotomousvariables as described under dummyvariables.

Influential data and residuals

A residual for a Y point is thedifference between the observed and fitted value for that point, i.e. it is thedistance of the point from the fitted regression line. If the pattern ofresiduals changes along the regression line then consider using rankmethods or linear regression after an appropriate transformationof your data.

The influential data option in StatsDirect gives an analysis of residuals and allows youto save the residuals and their associated statistics to a workbook. It is goodpractice to examine a scatter plot of the residuals against fitted Y values.You might also wish to inspect a normal plot ofthe residuals and perform a Shapiro-Wilktest to look for evidence of non-normality.

Standard error for the predictedY, leverage hi (the ith diagonal element of the hat (XXi) matrix), Studentized residuals,jackknife residuals, Cook's distance and DFIT are also given with theresiduals. For further information on analysis of residuals please see Belsley et al.(1980), Kleinbaumet al. (1998) or Draper and Smith(1998).

- where p is the number ofparameters in the model, n is the number of observations, eiis a residual, ri is a Studentizedresidual, r-i is a jackknife residual, s² is theresidual mean square, s²-i is an estimate of s² after deletion of the ith residual, hi is the leverage (ithdiagonal element of the hat or XXi matrix), di is Cook's distance and DFITiis DFFITS.

·Studentizedresiduals have a t distribution with n-p degrees of freedom.

·Jackknife residuals have a t distribution with n-p-1 degrees offreedom. Note that SAS refers to these as "rstudent".

·If leverage (hi) islarger than the minimum of 3p/n and 0.99 then the ithobservation has unusual predictor values.

·Unusual predicted as opposed to predictor values are indicated bylarge residuals.

·Cook's distance and DFIT each combine predicted and predictorfactors to measure overall influence.

·Cook's distance is unusually large if it exceeds F (a, p, n-p) from the F distribution.

·DFIT is unusually large if it exceeds 2*sqr(p/n).

Collinearity

The degreeto which the x variables are correlated, and thus predict one another, is collinearity. If collinearity isso high that some of the x variables almost totally predict other x variablesthen this is known as multicollinearity. In suchcases, the analysis of variance for the overall model may show a highlysignificantly good fit, when paradoxically; the tests for individual predictorsare non-significant.

Multicollinearity causes problems in using regression models to drawconclusions about the relationships between predictors and outcome. Anindividual predictor's P value may test non-significant even though it isimportant. Confidence intervals for regression coefficients in a multicollinear model may be so high that tiny changes inindividual observations have a large effect on the coefficients, sometimesreversing their signs.

StatsDirect gives variance inflation factor (and the reciprocal,tolerance) as a measure of collinearity.

VIFi is thevariance inflation factor for the ith predictor. Ri² is the multiple correlation coefficient when the ith predictoris taken as the outcome predicted by the remaining x variables.

If youdetect multicollinearity you should aim to clarifythe cause and remove it. For example, an unnecessary collinear predictor mightbe dropped from your model, or predictors might meaningfully be combined, e.g.combining height and weight as body mass index. Increasing your sample sizereduces the impact of multicollinearity. More complexstatistical techniques exist for dealing with multicollinearity,but these should be used under the expert guidance of a Statistician.

Chatterjee et al(2000) suggest that multicollinearity is present if the mean VIF isconsiderably larger than 1, and the largest VIF is greater than 10 (otherschoose larger values, StatsDirect uses 20 as thethreshold for marking the result with a red asterisk).

Predictionand adjusted means

Theprediction option allows you to calculate values of the outcome (Y) using yourfitted linear model coefficients with a specified set of values for thepredictors (X1…p). A confidence interval and a prediction interval (Altman,1991) are given for eachprediction.

The defaultX values shown are those required to calculate the least squares mean for themodel, which is the mean of Y adjusted for all X. For continuous predictors themean of X is used. For categorical predictors you should use X as 1/k, where kis the number of categories. StatsDirect attempts toidentify categorical variables but you should check the values against theserules if you are using categorical predictors in this way.

Forexample, if a model of Y = systolic blood pressure, X1 = sex, X2 = age wasfitted, and you wanted to know the age and sex adjusted mean systolic bloodpressure for the population that you sampled, you could use the predictionfunction to give the least squares mean as the answer, i.e. with X1 as 0.5 andX2 as mean age. If you wanted to know the mean systolic blood pressure formales, adjusted for age then you would set X1 to 1 (if male sex is coded as 1 in your data).

Partialcorrelation

The partialcorrelation coefficient for apredictor Xk describes the relationship of Y and Xk when all other X are held fixed, i.e. the correlation ofY and Xk after taking away the effects of all otherpredictors in the model. The r statistic displayed with the main regressionresults is the partial correlation. It is calculated from the t statistic forthe predictor as:

Multiplecorrelation

Themultiple correlation coefficient (R) is Pearson's product moment correlationbetween the predicted values and the observed values (Y' and Y). Just as r² isthe proportion of the total variance (s²) of Y that can beexplained by the linear regression of Y on x, R² is the proportion of the variance explained by themultiple regression. The significance of R is testedby the F statistic of the analysis of variance for theregression.

An adjusted value of R² is given as Ra²:

The adjustment allows comparisonof Ra²between different regression models by compensating for the fact that R² is bound to increase with the number ofpredictors in the model.

The DurbinWatson test statistic can be used to test for certain types of serialcorrelation (autocorrelation). For example, if a critical instrument graduallydrifted off scale during collection of the outcome variable then there would becorrelations due to the drift; in time ordered data these may be detected byautocorrelation tests such as Durbin Watson. See Draper and Smith(1998) for more information,including critical values of the test statistic.

Automatic selection ofpredictors

There are a number of methods forselecting a subset of predictors that produce the "best" regression.Many statisticians discourage general use of these methods because they candetract from the real-world importance of predictors in a model. Examples ofpredictor selection methods are step-up selection, step-down selection,stepwise regression and best subset selection. The fact that there is nopredominant method indicates that none of them are broadly satisfactory,a good discussion is given by Draper and Smith(1998). StatsDirect provides best subsetselection by examination of all possible regressions. You have the option ofeither minimum Mallow's Cp or maximum overall F as base statistic for subsetselection. You may also force the inclusion of variables in this selectionprocedure if you consider their exclusion to be illogical in "real world"terms. Subset selection is best performed under expert statistical guidance.

Weights for outcomeobservations

StatsDirect can perform a general linear regression for which some outcome (Y)observations are given more weight in the model than others. An example ofweights is 1/variance for each Yi where Yi is a mean of multiple observations.This sort of analysis makes strong assumptions and is thus best carried outonly under expert statistical guidance. An unweightedanalysis is performed if you press cancel when asked for a weight variable.

Technical Validation

StatsDirect uses QR decomposition by Givens rotations to solve the linearequations to a high level of accuracy (Gentleman, 1974;Golub and Van Loan, 1983). Predictors that are highly correlated with otherpredictors are dropped from the model (you are warned of this in the results).If the QR method fails (rare) then StatsDirect willsolve the system by singular value decomposition (Chan, 1982).

Example

From Armitage and Berry(1994, p. 316).

Test workbook (Regressionworksheet: X1, X2, YY).

The following data are from atrial of a hypotensive drug used to lower bloodpressure during surgery. The outcome/dependent variable (YY) is minutes takento recover an acceptable (100mmHg) systolic blood pressure and the twopredictor or explanatory variables are, log dose of drug (X1) and mean systolicblood pressure during the induced hypotensive episode(X2).

YY

X1

X2

7

2.26

66

10

1.81

52

18

1.78

72

4

1.54

67

10

2.06

69

13

1.74

71

21

2.56

88

12

2.29

68

9

1.8

59

65

2.32

73

20

2.04

68

31

1.88

58

23

1.18

61

22

2.08

68

13

1.7

69

9

1.74

55

50

1.9

67

12

1.79

67

11

2.11

68

8

1.72

59

26

1.74

68

16

1.6

63

23

2.15

65

7

2.26

72

11

1.65

58

8

1.63

69

14

2.4

70

39

2.7

73

28

1.9

56

12

2.78

83

60

2.27

67

10

1.74

84

60

2.62

68

22

1.8

64

21

1.81

60

14

1.58

62

4

2.41

76

27

1.65

60

26

2.24

60

28

1.7

59

15

2.45

84

8

1.72

66

46

2.37

68

24

2.23

65

12

1.92

69

25

1.99

72

45

1.99

63

72

2.35

56

25

1.8

70

28

2.36

69

10

1.59

60

25

2.1

51

44

1.8

61

To analysethese data in StatsDirect you must first enter theminto three columns in the workbook appropriately labelled.Alternatively, open the test workbook using the file open function of the filemenu. Then select Multiple Linear Regression from the Regression andCorrelation section of the analysis menu. When you are prompted for regressionoptions, tick the "calculate intercept" box (it is unusual to havereason not to calculate an intercept) and leave the "use weights" boxunticked (regression with unweightedresponses). Selectthe column "YY" when prompted for response and "X1" and"X2" when prompted for predictor data.

For this example:

Multiple linear regression

Intercept

b0 = 23.010668

t = 1.258453

P = 0.2141

X1

b1 = 23.638558

r = 0.438695

t = 3.45194

P = 0.0011

X2

b2 = -0.714675

r = -0.317915

t = -2.371006

P = 0.0216

yy = 23.010668 +23.638558 x1-0.714675 x2

Analysis of variance fromregression

Source of variation

Sum Squares

DF

Mean Square

Regression

2783.220444

2

1391.610222

Residual

11007.949367

50

220.158987

Total (corrected)

13791.169811

52

Root MSE = 14.837755

F = 6.320933 P = .0036

Multiple correlation coefficient

(R) = 0.449235

R² = 20.181177%

Ra² = 16.988424%

Durbin-Watson test statistic =1.888528

The variance ratio, F, for theoverall regression is highly significant thus we have very little reason todoubt that either X1 or X2 is, or both are, associated with YY. The r squarevalue shows that only 20% of the variance of YY is accounted for by theregression, therefore the predictive value of this model is low. The partialcorrelation coefficients are shown to be significant but the intercept is not.

P values

confidenceintervals

Technical validation results

The American National Instituteof Standards and Technology provide Statistical Reference Datasets for testingstatistical software (McCullough andWilson, 1999; http://www.nist。gov.itl/div898/strd).The results below for the Longley data set (Longley, 1967)are given to 12 decimal places:

Multiple linear regression

Intercept

b0 = -3482258.63459587

t = -3.910802918155

P = .0036

x1

b1 = 15.061872271426

t = 0.177376028231

P = .8631

x2

b2 = -0.035819179293

t = -1.069516317221

P = .3127

x3

b3 = -2.020229803817

t = -4.136427355941

P = .0025

x4

b4 = -1.033226867174

t = -4.821985310446

P = .0009

x5

b5 = -0.051104105654

t = -0.226051144664

P = .8262

x6

b6 = 1829.15146461358

t = 4.01588981271

P = .003

y = -3482258.63459587+15.061872271426 x1 -0.035819179293 x2 -2.020229803817 x3 -1.033226867174 x4-0.051104105654 x5 +1829.15146461358 x6

Copyright © 1990-2006 StatsDirectLimited, all rights reserved

Download a free 10 day StatsDirect trial

PartialCorrelation

Partial correlation is a methodused to describe the relationship between two variables whilst taking away theeffects of another variable, or several other variables, on this relationship.

Partial correlation is bestthought of in terms of multiple regression; StatsDirect shows the partial correlation coefficient rwith its main results from multiplelinear regression.

A different way to calculatepartial correlation coefficients, which does not require a full multipleregression, is show below for the sake of further explanation of theprinciples:

Consider a correlation matrix forvariables A, B and C (note that the multiple line regression function in StatsDirect will output correlation matrices for you as oneof its options):

A B C

A *

B r(AB)*

C r(AC)r(BC) *

The partial correlation of A andB adjusted for C is:

The same can be done usingSpearman's rank correlation co-efficient.

The hypothesis test for thepartial correlation co-efficient is performed in the same way as for the usualcorrelation co-efficient but it is based upon n-3 degrees of freedom.

Please note that this sort ofrelationship between three or more variables is more usefully investigatedusing the multiple regression itself (Altman, 1991).

The general form of partialcorrelation from a multiple regression is as follows:

- where tk is the Student t statistic for the kthterm in the linear model.

Copyright © 1990-2006 StatsDirectLimited, all rights reserved

Download a free 10 day StatsDirect trial

Regressionin groups

Menu location: Analysis_Regression & Correlation_GroupedLinear.

Linearitywith replicates of Y

Groupedlinear regression with covariance analysis

This section provides groupedlinear regression and analysis of covariance. There is also a test forlinearity when repeated observations of the response/outcome (Y) variable areavailable for each observation in a single predictor (X) variable.

Copyright © 1990-2006 StatsDirectLimited, all rights reserved

Download a free 10 day StatsDirect trial

Groupedlinear regression with covariance analysis

Menu location: Analysis_Regression & Correlation_GroupedLinear_Covariance.

This function compares the slopesand separations of two or more simple linear regression lines.

The method involves examinationof regression parameters for a group of xY pairs inrelation to a common fitted function. This provides an analysis of variancethat shows whether or not there is a significant difference between the slopesof the individual regression lines as a whole. StatsDirectthen compares all of the slopes individually. The vertical distance betweeneach regression line is then examined using analysis of covariance and thecorrected means are given (Armitage and Berry,1994).

Assumptions:

·Y replicates are a random sample froma normal distribution

·deviations from theregression line (residuals) follow a normal distribution

·deviations from theregression line (residuals) have uniform variance

This is just one facet ofanalysis of covariance; there are additional and alternative methods. Forfurther information, see Kleinbaum et al.(1998) and Armitage& Berry (1994). Analysis of covariance is best carried out as part of abroader regression modelling exercise by aStatistician.

Technical Validation

Slopes ofseveral regression lines are compared by analysis of variance as follows (Armitage, 1994):

- where SScommon is the sum of squares due to the commonslope of k regression lines, SSbetweenis the sum of squares due to differences between the slopes, SStotal is the total sum of squares and the residualsum of squares is the difference between SStotaland SScommon. Sxxjis the sum of squares about the mean x observation in the jth group, SxYj isthe sum of products of the deviations of xYpairs from their means in the jth group and SYYj is the sum of squares about the mean Yobservation in the jth group.

Verticalseparation of slopes of several regression lines is tested by analysis ofcovariance as follows (Armitage, 1994):

- where SS are corrected sums of squares within the groups,total and between the groups (subtract within from total). The constituent sumsof products or squares are partitioned between groups, within groups and totalas above.

Data preparation

If there are equal numbers ofreplicate Y observations or single Y observations for each x then you are bestprepare and select your data using a group identifier variable. For examplewith three replicates you would prepare five columns of data: group identifier,x, y1, y2, and y3. Remember to choose the "Groups by identifier"option in this case.

If there are unequal numbers ofreplicate Y observations for each x then you must prepare the x data inseparate columns by group, prepare the Y data in separate columns by group andobservation (i.e. Y for group 1 observation 1… r rows long where r is thenumber of repeat observations). Remember to choose the "Groups bycolumn" option in this case. This is done in the example below.

Example

From Armitage and Berry(1994).

Test workbook (Regressionworksheet: Log Dose_Std, BD 1_Std, BD 2_Std, BD3_Std, Log Dose_I, BD 1_I, BD 2_I, BD 3_I, Log Dose_F, BD 1_F, BD 2_F, BD 3_F).

Three different preparations ofVitamin D are tested for their effect on bones by feeding them to rats thathave an induced lack of mineral in their bones. X-ray methods are used to testthe re-mineralisation of bones in response to theVitamin D.

For the standard preparation:

Log dose of Vit D

0.544

0.845

1.146

Bone density score

0

1.5

2

0

2.5

2.5

1

5

5

2.75

6

4

2.75

4.25

5

1.75

2.75

4

2.75

1.5

2.5

2.25

3

3.5

2.25

3

2.5

2

3

4

4

For alternative preparation I:

Log dose of Vit D

0.398

0.699

1.000

1.301

1.602

Bone density score

0

1

1.5

3

3.5

1

1.5

1

3

3.5

0

1.5

2

5.5

4.5

0

1

3.5

2.5

3.5

0

1

2

1

3.5

0.5

0.5

0

2

3

For alternative preparation F:

Log dose of Vit D

0.398

0.699

1.000

Bone density score

2.75

2.5

3.75

2

2.75

5.25

1.25

2.25

6

2

2.25

5.5

0

3.75

2.25

0.5

3.5

To analysethese data in StatsDirect you must first enter theminto 14 columns in the workbook appropriately labelled.The first column is just three rows long and contains the three log doses ofvitamin D for the standard preparation. The next three columns represent therepeated measures of bone density for each of the three levels of log dose ofvitamin D which are represented by the rows of the first column. This is thenrepeated for the other two preparations. Alternatively, open the test workbookusing the file open function of the file menu. Then select covariance from thegroups section of the regression and correlation section of the analysis menu. Selectthe columns marked "Log Dose_Std","Log Dose_I" and "Log Dose_F" when you are prompted for the predictor (x)variables, these contain the log dose levels (logarithms are taken because,from previous research, the relationship between bone re-mineralisationand Vitamin D is known to be log-linear).Make sure that the "use Y replicates" option is checked when you areprompted for it. Then select the outcome (Y) variables that represent thereplicates. You will have to select three, five and three columns in just threeselectionactions because these are the number of corresponding dose levels in the xvariables in the order in which you selected them.

Alternatively, these data couldhave been entered in just three pairs of workbook columns representing thethree preparations with a log dose column and column of the mean bone densityscore for each dose level. By accepting the more long winded input ofreplicates, StatsDirect is encouraging you to run atest of linearityon your data.

For this example:

Grouped linear regression

Source of variation

SSq

DF

MSq

VR

Common slope

78.340457

1

78.340457

67.676534

P < 0.0001

Between slopes

4.507547

2

2.253774

1.946984

P = 0.1501

Separate residuals

83.34518

72

1.157572

Within groups

166.193185

75

Common slope is significant

Difference between slopes is NOTsignificant

Slope comparisons:

slope 1 (Log Dose_Std) v slope 2 (Log Dose_I) = 2.616751 v 2.796235

Difference (95% CI) = 0.179484(-1.576065 to 1.935032)

t = -0.203808, P = 0.8391

slope 1 (Log Dose_Std) v slope 3 (Log Dose_F) = 2.616751 v 4.914175

Difference (95% CI) = 2.297424(-0.245568 to 4.840416)

t = -1.800962, P = 0.0759

slope 2 (Log Dose_I) v slope 3 (Log Dose_F) = 2.796235 v 4.914175

Difference (95% CI) = 2.11794(-0.135343 to 4.371224)

t = -1.873726, P = 0.065

Covariance analysis

Uncorrected:

Source of variation

YY

xY

xx

DF

Between groups

17.599283

-3.322801

0.988515

2

Within

166.193185

25.927266

8.580791

8

Total

183.792468

22.604465

9.569306

10

Corrected:

Source of variation

SSq

DF

MSq

VR

Between groups

42.543829

2

21.271915

1.694921

Within

87.852727

7

12.55039

Total

130.396557

9

P = 0.251

Corrected Y means ± SE for mean xmean 0.85771:

Y'= 2.821356, ± 2.045448

Y'= 1.453396, ± 1.593641

Y'= 3.317784, ± 2.054338

Line separations (common slope=3.021547):

line 1 (Log Dose_Std) vsline 2 (Log Dose_I) Vertical separation = 1.367959

95% CI = -4.760348 to 7.496267

t = 0.527831, (7 df), P = 0.6139

line 1 (Log Dose_Std) vsline 3 (Log Dose_F) Vertical separation = -0.496428

95% CI = -7.354566 to 6.36171

t = -0.171164, (7 df), P = 0.8689

line 2 (Log Dose_I) vsline 3 (Log Dose_F) Vertical separation = -1.864388

95% CI = -8.042375 to 4.3136

t = -0.713594, (7 df), P = 0.4986

The common slope is highlysignificant and the test for difference between the slopes overall wasnon-significant. If our assumption of linearity holds true we can conclude thatthese lines are reasonably parallel. Looking more closely at the individualslopes preparation F is almost shown to be significantly different from theother two but this difference was not large enough to throw the overall slopecomparison into a significant heterogeneity.

The analysis of covariance didnot show any significant vertical separation of the three regression lines.

P values

Copyright © 1990-2006 StatsDirectLimited, all rights reserved

Download a free 10 day StatsDirect trial

Linearity withreplicates of Y

Menu location: Analysis_Regression & Correlation_GroupedLinear_Linearity.

This function gives a test forlinearity in a simplelinear regression model when the response/outcome variable (Y) has beenmeasured repeatedly.

The standard analysis of variancefor simple (one predictor) linear regression tests for the possibility that theobserved data fit to a straight line but it does not test whether or not astraight line model is appropriate in the first place. This function can beused to test that assumption of linearity.

For studies that employ linearregression, it is worth collecting repeat Y observations. This enables you torun a test of linearity and thus justify or refute the use of linear regressionin subsequent analysis (Armitage and Berry,1994). Replicate Y observations should be entered in separate workbookcolumns (variables), one column for each observation in the predictor (x)variable. The number of Y replicate variables which you are prompted to selectis the number of rows in the x variable.

Please note that the repeats of Yobservations should be arranged in a worksheet so that each Y column containsrepeats of a single Y observation that matches an x observation (i.e. row 4 ofx should relate to column 4 of Y repeats). If your data are arranged so thateach repeat of Y is in a separate column then use the Rotate DataBlock function of the Data menu on your Y data before selecting them forthis function.

Assumptions:

·Y replicates are a random sample froma normal distribution

·deviations from theregression line (residuals) follow a normal distribution

·deviations from theregression line (residuals) have uniform variance

Generalisations of this method to models with more than one predictor are available(Draper andSmith, 1998). Generalised replicate analysis isbest done as a part of exploratory modelling by aStatistician.

Technical Validation

Linearityis tested by analysis of variance for the linear regression of k outcomeobservations for each level of the predictor variable (Armitage,1994):

- where SSregression is the sum of squares due to theregression of Y on x, SSrepeatsis the part of the usual residual sum of squares that is due to variationwithin repeats of the outcome observations, SStotalis the total sum of squares and the remainder represents the sum of squares dueto deviation of the means of repeated outcome observations from the regression.Y is the outcome variable, x is the predictor variable, Nis the total number of Y observations and nj is the number of Y repeats for the jth x observation.

Example

From Armitage and Berry(1994, p. 288).

Test workbook (Regressionworksheet: Log Dose_Std, BD 1_Std, BD 2_Std, BD 3_Std).

A preparation of Vitamin D istested for its effect on bones by feeding it to rats that have an induced lackof mineral in their bones. X-ray methods are used to test the re-mineralisation of bones in response to the Vitamin D.

Log dose of Vit D

0.544

0.845

1.146

Bone density score

0

1.5

2

0

2.5

2.5

1

5

5

2.75

6

4

2.75

4.25

5

1.75

2.75

4

2.75

1.5

2.5

2.25

3

3.5

2.25

3

2.5

2

3

4

4

To analysethese data in StatsDirect you must first enter theminto four columns in the workbook and label them appropriately. The firstcolumn is just three rows long and contains the three log doses of Vitamin Dabove (logarithms are taken because, from previous research, the relationshipbetween bone re-mineralisation and Vitamin D is knownto be log-linear).The next three columns represent the repeated measures of bone density for eachof the three levels of log dose of Vitamin D which are represented by the rowsof the first column. Alternatively, open the test workbook using the file openfunction of the file menu. Then select linearity from the groups section of theregression and correlation section of the analysis menu. Selectthe column marked "Log Dose_Std" when youare prompted for the x variable, this contains thethree log dose levels. Then select,in one action, the three Y columns "BD 1_Std", "BD 2_Std"and "BD 3_Std" which correspond to each row (level) of the x variablei.e. 0.544 --> 0.845 --> 1.146.

For this example:

Linearity with replicates of Y

Source of variation

SSq

DF

MSq

VR

Due to regression

14.088629

1

14.088629

9.450512

P = .0047

Deviation of x means

2.903415

1

2.903415

1.947581

P = .1738

Within x residual

41.741827

28

1.49078

Total

58.733871

30

Regression slope is significant

Assumption of linearity supported

Thus the regression itself(meaning the slope) was statistically highly significant. If the deviationsfrom x means had been significant then we should have rejected our assumptionof linearity, as it stands they were not.

P values

non-linear models

Copyright © 1990-2006 StatsDirectLimited, all rights reserved

Download a free 10 day StatsDirect trial

Polynomialregression

Menu location: Analysis_Regression & Correlation_Polynomial.

This function fits a polynomialregression model to powers of a single predictor by the method of linear leastsquares. Interpolation and calculation of areas under the curve are also given.

If a polynomial model isappropriate for your study then you may use this function to fit a korder/degree polynomial to your data:

- where Y caret is the predicted outcome value for thepolynomial model with regression coefficients b1 to k for each degree and Yintercept b0. The model is simply a generallinear regression model withk predictors raised to the power of i where i=1 to k. A second order (k=2) polynomial forms a quadraticexpression (parabolic curve), a third order (k=3) polynomial forms a cubicexpression and a fourth order (k=4) polynomial forms a quarticexpression. See Kleinbaum et al.(1998) and Armitage and Berry(1994) for more information.

Somegeneral principles:

·the fitted model ismore reliable when it is built on large numbers of observations.

·do not extrapolate beyond the limits of observed values.

·choose values for the predictor (x) that are not too large as theywill cause overflow with higher degree polynomials; scale x down if necessary.

·do not draw false confidence from low P values,use these to support your model only if the plot looks reasonable.

Morecomplex expressions involving polynomials of more than one predictor can beachieved by using the generallinear regression function.For more detail from the regression, such as analysisof residuals, use the general linear regression function. To achieve apolynomial fit using general linear regression you must first create newworkbook columns that contain the predictor (x) variable raised to powers up tothe order of polynomial that you want. For example, a second order fit requiresinput data of Y, x and x².

Model fit and intervals

Subjective goodness of fit may beassessed by plotting the data and the fitted curve. An analysis of variance isgiven via the analysis option; this reflects the overall fit of the model. Tryto use as few degrees as possible for a model that achieves significance ateach degree.

The plot function supplies abasic plot of the fitted curve and a plot with confidence bands and predictionbands. You can save the fitted Y values with their standard errors, confidenceintervals and prediction intervals to a workbook.

Area under curve

The option to calculate the areaunder the fitted curve employs two different methods. The first methodintegrates the fitted polynomial function from the lowest to the highestobserved predictor (x) value using Romberg's integration. The second methoduses the trapezoidal rule directly on the data to provide a crude estimate.

Technical Validation

StatsDirect uses QR decomposition by Givens rotations to solve the linearequations to a high level of accuracy (Gentleman, 1974;Golub and Van Loan, 1983). If the QR method fails (rare) then StatsDirect will solve the system by singular valuedecomposition (Chan,1982).

Example

From McClave and Deitrich(1991, p. 753).

Test workbook (Regressionworksheet: Home Size, KW Hrs/Mnth).

Here we use an example from thephysical sciences to emphasise the point thatpolynomial regression is mostly applicable to studies where environments arehighly controlled and observations are made to a specified level of tolerance.The data below are the electricity consumptions in kilowatt-hours per monthfrom ten houses and the areas in square feet of those houses:

Home Size

KW Hrs/Mnth

1290

1182

1350

1172

1470

1264

1600

1493

1710

1571

1840

1711

1980

1804

2230

1840

2400

1956

2930

1954

To analysethese data in StatsDirect you must first prepare themin two workbook columns appropriately labelled.Alternatively, open the test workbook using the file open function of the filemenu. Then select Polynomial from the Regression and Correlation section of theanalysis menu. Selectthe column marked "KW hrs/mnth" when askedfor the outcome (Y) variable and select the column marked "Home size"when asked for the predictor (x) variable. Enter the order of this polynomialas 2.

For this example:

Polynomial regression

Intercept

b0= -1216.143887

t = -5.008698

P = .0016

Home Size

b1= 2.39893

t = 9.75827

P < .0001

Home Size^2

b2= -0.00045

t = -7.617907

P = .0001

KW Hrs/Mnth= -1216.143887 +2.39893 Home Size -0.00045 Home Size^2

Analysis of variance fromregression

Source of variation

Sum Squares

DF

Mean Square

Regression

831069.546371

2

415534.773185

Residual

15332.553629

7

2190.364804

Total (corrected)

846402.1

9

Root MSE = 46.801333

F = 189.710304 P < .0001

Multiple correlation coefficient

(R) = 0.990901

R² = 98.188502%

Ra² = 97.670932%

Durbin-Watson test statistic =1.63341

Polynomial regression - areaunder curve

AUC (polynomial function) =2855413.374801

AUC (by trapezoidal rule) =2838195

Thus, the overall regression andboth degree coefficients are highly significant.

N.B. Lookat a plot of this data curve. The right hand end shows a very sharp decline. Ifyou were to extrapolate beyond the data, you have observed then you mightconclude that very large houses have very low electricity consumption. This isobviously wrong. Polynomials are frequently illogical for some parts of afitted curve. You must blend common sense, art and mathematics when fittingthese models! Remember the general principles listed above.

P values

non-linear models

Copyright © 1990-2006 StatsDirectLimited, all rights reserved

Download a free 10 day StatsDirect trial

Linearizedestimates

Menu location: Analysis_Regression & Correlation_LinearizedEstimates.

This section provides simple (onepredictor) regression estimates for three linearizedfunctions (exponential, geometric and hyperbolic) by an unweightedleast squares method.

These functions should be usedonly to indicate that a more robust fit of the selected model is worthinvestigating with your data.

Exponential

Data are linearizedby logarithmic transformation of the predictor (x) variable. Simplelinear regression of Y vs. ln(x) gives a = ln(intercept)and b = slope for the function:

Geometric

Data are linearizedby logarithmic transformation of both variables. Simple linear regression of ln(Y) vs. ln(x) gives a = ln(intercept)and b = slope for the function:

Hyperbolic

Data are linearizedby reciprocal transformation of both variables. Simple linear regression of 1/Yvs. 1/x gives a = slope and b = intercept for the function:

The standard error of theestimate is given for each of these regressions but please note that the errorsof your outcome/response variable might not be from a normal distribution.

This section of StatsDirect is intended only for those who are familiarwith regression modelling and who use these linearized estimates as a springboard for further modelling. GLIM and Genstatprovide comprehensive tools for generalised linear modelling. MLP and Genstatprovide comprehensive tools for non-linear modelling.GLIM, MLP and Genstat are available from Numerical Algorithms Group Ltd.

P values

confidenceintervals

non-linear models

Copyright © 1990-2006 StatsDirectLimited, all rights reserved

Download a free 10 day StatsDirect trial

Probitanalysis

Menu location: Analysis_Regression & Correlation_ProbitAnalysis.

This function provides probit analysis for fitting probitand logit sigmoid dose/stimulus response curves andfor calculating confidence intervals for dose-response quantilessuch as ED50.

When biological responses areplotted against their causal stimuli (or logarithms of them) they often form asigmoid curve. Sigmoid relationships can be linearizedby transformationssuch as logit, probit andangular. For most systems the probit (normal sigmoid)and logit (logistic sigmoid) give the most closelyfitting result. Logistic methods are useful in Epidemiology because odds ratioscan be determined easily from differences between fitted logits(see logisticregression). In biological assay work, however, probitanalysis is preferred (Finney, 1971, 1978).Curves produced by these methods are very similar, with maximum variationoccurring within 10% of the upper and lower asymptotes.

Probit

- whereY' is the probit transformed value (5 used to beadded to avoid negative values in hand calculation), p is the proportion (p =responders/total number) and inverse F (p) is the 100*p% quantilefrom the standardnormal distribution.

Logit

Odds = p/(1-p)

[p = proportional response, i.e.r out of n responded so p = r/n]

Logit = log odds = log(p/(1-p))

Data preparation

Your data are entered as doselevels, number of subjects tested at each dose level and number responding ateach dose level. At the time of running the analysis you may enter a controlresult for the number of subjects responding in the absence of dose/stimulus;this provides a global adjustment for natural mortality/responsiveness. You mayalso specify automatic log transformation of the dose levels at run time ifappropriate (this should be supported by good evidence of a log-probit relationship for your type of study).

Model analysis and critical quantiles

The fitted model is assessed bystatistics for heterogeneity which follow a chi-square distribution. If theheterogeneity statistics are significant then yourobserved values deviate from the fitted curve too much for reliable inferenceto be made (Finney,1971, 1978).

StatsDirect gives you the effective/lethal levels of dose/stimulus withconfidence intervals at the quantiles you specify,e.g. ED50/LD50.

Technical validation

The curve is fitted by maximumlikelihood estimation, using Newton-Raphsoniteration. A dummy variable isused to factor in the background/natural response rate if theyou specify a response in controls.

Further analysis and cautions

For more complex probit analysis, such as the calculation of relativepotencies from several related dose response curves, consider non-linear optimisation software or specialist dose-response analysissoftware such as Bliss. The latter is a FORTRAN routine written by David Finneyand Ian Craigie, it is available from EdinburghUniversity Computing Centre. MLP or Genstat can be used for more general non-linearmodel fitting with the ability to constrain curves to "parallelism".Expert statistical guidance should be sought before attempting this sort ofwork.

CAUTION 1: Please do not think ofprobit analysis as a "cure all" for doseresponse curves. Many log dose - response relationships are clearly notGaussian sigmoids. Other well described sigmoidrelationships include angular, Wilson-Worcester and Cauchy-Urban. There may beno "off the shelf" regression model suited to your study. Exploratorynon-linearmodelling should only be carried out by an expert Statistician.

CAUTION 2: Standard probit analysis is designed to handle only quantal responses with binomial error distributions. Quantal data, such as the number of subjects responding vs.total number of subjects tested, usually have binomial error distributions. Youshould not use continuous data, such as percent maximal response, with probit analysis as these data are likely to requireregression methods that assume a different error distribution. Most researchersshould seek expert statistical help before pursuing this type of analysis (see non-linearmodels).

Example

From Finney (1971, p. 98).

Test workbook (Regressionworksheet: Age, Girls, + Menses).

The following data represent astudy of the age at menarche (first menstruation) of 3918 Warsaw girls. For each age group you aregiven mean age, total number of girls and the number of girls who had reachedmenarche.

Age

Girls

+ Menses

9.21

376

0

10.21

200

0

10.58

93

0

10.83

120

2

11.08

90

2

11.33

88

5

11.58

105

10

11.83

111

17

12.08

100

16

12.33

93

29

12.58

100

39

12.83

108

51

13.08

99

47

13.33

106

67

13.58

105

81

13.83

117

88

14.08

98

79

14.33

97

90

14.58

120

113

14.83

102

95

15.08

122

117

15.33

111

107

15.58

94

92

15.83

114

112

17.58

1049

1049

To analysethese data in StatsDirect you must first prepare themin three workbook columns appropriately labelled.Alternatively, open the test workbook using the file open function of the filemenu. Then select Logit from the Probit analysis section of the Regression andCorrelation section of the analysis menu. Selectthe column marked "Age" when you are prompted for dose levels, select"Girls" when you are prompted for subjects at each level and select"Menses" when prompted for responders at each level. Make sure thatthe "Calc log10" option is not checked when prompted, this disablesbase 10 logarithmic transformation of the "dose" variable (mean agesin this example). Enter number of controls as 0 when prompted and also enter 0when you are asked about an additional quantile.

For this example:

Probit analysis - logit sigmoid curve

constant = -10.613197

slope = 0.815984

Median * Dose = 13.006622

Confidence interval (NoHeterogeneity) = 12.930535 to 13.082483

* Dose for centile90 = 14.352986

Confidence interval (NoHeterogeneity) = 14.238636 to 14.480677

Chi² (heterogeneity of deviationsfrom model) = 21.869852, (23 df),P = .5281

t for slope = 27.682452 (23 df) P <.0001

Having looked at a plot of thismodel and accepted that the model is reasonable, we conclude with 95%confidence that the true population value for median age at menarche in Warsaw lay between 12.93and 13.08 years when this study was carried out.

P values

confidenceintervals

non-linear models

Copyright © 1990-2006 StatsDirectLimited, all rights reserved

Download a free 10 day StatsDirect trial

Non-linearmodels

Many relationships betweenfactors observed in the natural world are non-linear. A popular statisticalapproach to the study of these relationships is to transformthem so that they are approximately linear and therefore amenable to thewell-established numerical methods of linear modelling.Linear transformation works well in many situations but is not possible inothers.

A substantial problem withfitting transformed variables is that errors you assumed to be normal in thenon-transformed variable become non-normal after transformation. In specificapplications, such as probitanalysis, the error calculations have been designed for use with data thathave a specified error distribution other than normal. It is not advisable tofeed transformed variables through linear regression without a soundstatistical argument for doing so. If you are confident of a particular modelthen you may be justified in using a generalisedlinear model method to fit your data. Examples of this are probitanalysis and logistic regression. StatsDirect offersgeneral logisticregression for dichotomous responses. SASprovides logistic regression for dichotomous and ordinal responses. GLIM provides a broad range of generalised linear models.

Non-linear statistical modelling is a highly complex subject that blends theinstincts of experience with art and science. Seek the help of a Statisticianif you wish to use non-linear methods.

N.B. Beware of any computersoftware that claims to be a general solving engine for non-linear problems.There are very few validated non-linear estimation packages that are supportedby experts in the field. One example is MLP (available from Numerical Algorithms Group Ltd.) by Gavin Rossat Rothamsted Experimental Station (Ross, 1990).

Copyright © 1990-2006 StatsDirectLimited, all rights reserved

Download a free 10 day StatsDirect trial

Principalcomponents analysis and alpha reliability coefficient

Menu locations:

Analysis_Regression & Correlation_Principal Components

Analysis_Agreement_Reliability & Reducibility

This function provides principalcomponents analysis (PCA), based upon correlation or covariance, and Cronbach's coefficient alpha for scale reliability.

See questionnairedesign for more information on how to use these methods in designingquestionnaires or other study methods with multiple elements.

Principal components analysis ismost often used as a data reduction technique for selecting a subset of"highly predictive" variables from a larger group of variables. Forexample, in order to select a sample of questions from a thirty-questionquestionnaire you could use this method to find a subset that gave the"best overall summary" of the questionnaire (Johnson and Wichern,1998; Armitage and Berry, 1994; Everitt and Dunn, 1991; Krzanowski, 1988).

There are problems with thisapproach, and principal components analysis is often wrongly applied and badlyinterpreted. Please consult a statistician before using this method.

PCA does not assume anyparticular distribution of your original data but it is very sensitive tovariance differences between variables. These differences might lead you to thewrong conclusions. For example, you might be selecting variables on the basisof sampling differences and not their "real" contributions to thegroup. Armitageand Berry (1994) give an example of visual analogue scale results to whichprincipal components analysis was applied after the data had been transformedto angles as a way of stabilising variances.

Another problem area with thismethod is the aim for an orthogonal or uncorrelated subset of variables.Consider the questionnaire problem again: it is fair to say that a pair ofhighly correlated questions are serving much the same purpose, thus one of themshould be dropped. The component dropped is most often the one that has thelower correlation with the overall score. It is not reasonable, however, toseek optimal non-correlation in the selected subset of questions. There may bemany "real world" reasons why particular questions should remain inyour final questionnaire. It is almost impossible to design a questionnairewhere all of the questions have the same importance to every subject studied.For these reasons you should cast a net of questions that cover what you aretrying to measure as a whole. This sort of design requires strong knowledge ofwhat you are studying combined with strong appreciation of the limitations ofthe statistical methods used.

Everitt and Dunn(1991) outline PCA and other multivariate methods. McDowell and Newell(1996) and Streiner& Norman (1995) offer practical guidance on the design and analysis ofquestionnaires.

Factor analysis vs. principalcomponents

Factor analysis (FA) is a childof PCA, and the results of PCA are often wrongly labelledas FA. A factor is simply another word for a component. In short, PCA beginswith observations and looks for components, i.e. working from data toward ahypothetical model, whereas FA works the other way around. Technically, FA isPCA with some rotation of axes. There are different types of rotations, e.g. varimax (axes are kept orthogonal/perpendicular duringrotations) and oblique Procrustean (axes are allowed to form oblique patternsduring rotations), and there is disagreement over which to use and how toimplement them. Unsurprisingly, FA is misused a lot. There is usually a betteranalytical route that avoids FA; you should seek the advice of a statisticianif you are considering it.

Data preparation

To prepare data for principalcomponents analysis in StatsDirect you must firstenter them in the workbook. Use a separate column for each variable (component)and make sure that each row corresponds to the observations from one subject.Missing data values in a row will cause that row / subject to be dropped fromthe analysis. You have the option of investigating either correlation orcovariance matrices; most often you will need the correlation matrix. Asdiscussed above, it might be appropriate to transformyour data before applying this method.

For the example of 0 to 7 scoresfrom a questionnaire you would enter your data in the workbook in the followingformat. You might want to transform these data first (Armitage and Berry,1994).

Question 1

Question 2

Question 3

Question 4

Question 5

subject 1

5

7

4

1

5

subject 2

3

3

2

2

6

subject 3

2

2

4

3

7

subject 4

0

0

5

4

2

Internal consistency anddeletion of individual components

Cronbach's alpha is a useful statistic for investigating the internalconsistency of a questionnaire. If each variable selected for PCA representstest scores from an element of a questionnaire, StatsDirectgives the overall alpha and the alpha that would be obtained if each element inturn were dropped. If you are using weights then you should use the weightedscores. You should not enter the overall test score as this is assumed to bethe sum of the elements you have specified. For most purposes alpha should beabove 0.8 to support reasonable internal consistency. If the deletion of anelement causes a considerable increase in alpha then you should considerdropping that element from the test. StatsDirecthighlights increases of more than 0.1 but this must be considered along withthe "real world" relevance of that element to your test. A standardised version of alpha is calculated by standardising all items in the scale so that their mean is0 and variance is 1 before the summation part of the calculation is done (Streiner and Norman,1995; McDowell and Newell, 1996; Cronbach, 1951). You should use standardised alpha if there are substantial differences inthe variances of the elements of your test/questionnaire.

Technical Validation

Singularvalue decomposition (SVD) is used to calculate the variance contribution ofeach component of a correlation or covariance matrix (Krzanowski,1988; Chan, 1982):

The SVD ofan n by m matrix X is USV' = X. U and V are orthogonalmatrices, i.e. V' V = V V' where V'is the transpose of V. U is a matrix formed fromcolumn vectors (m elements each) and V is a matrix formed fromrow vectors (n elements each). Sis a symmetrical matrix with positive diagonal entries in non-increasing order.If X is a mean-centred, n by mmatrix where n>m and rank r = m (i.e. full rank) then thefirst r columns of V are the first r principalcomponents of X. The positive eigenvaluesof X'X on XX are the squares of the diagonals in S. The coefficients or latent vectors arecontained in V.

Principalcomponent scores are derived from U and S via a S as trace{(X-Y)(X-Y)'}.For a correlation matrix, the principal component score is calculated for thestandardized variable, i.e. the original datum minus the mean of the variablethen divided by its standard deviation.

Scalereversal is detected by assessing the correlation between the input variablesand the scores for the first principal component.

A lowerconfidence limit for Cronbach's alpha is calculatedusing the sampling theory of Kristoff (1963) andFeldt (1965):

- where F is the F quantile for a100(1-p)% confidence limit, k is the number of variables and n is the number ofobservations per variable.

Example

Testworkbook (Agreement worksheet: Question 1, Question 2, Question 3, and Question4).

Principalcomponents (correlation)

Sign wasreversed for: Question 3; Question 4

Component

Eigenvalue (SVD)

Proportion

Cumulative

1

1.92556

48.14%

48.14%

2

1.305682

32.64%

80.78%

3

0.653959

16.35%

97.13%

4

0.114799

2.87%

100%

With rawvariables:

Scalereliability alpha = 0.54955 (95% lower confidence limit = 0.370886)

Variable dropped

Alpha

Change

Question 1

0.525396

-0.024155

Question 2

0.608566

0.059015

Question 3

0.411591

-0.13796

Question 4

0.348084

-0.201466

Withstandardized variables:

Scale reliabilityalpha = 0.572704 (95% lower confidence limit = 0.403223)

Variable dropped

Alpha

Change

Question 1

0.569121

-0.003584

Question 2

0.645305

0.072601

Question 3

0.398328

-0.174376

Question 4

0.328003

-0.244701

You can see from the results abovethat questions 2 and 3 seemed to have scales going in opposite directions tothe other two questions, so they were reversed before the final analysis.Dropping question 2 improves the internal consistency of the overall set ofquestions, but this does not bring the standardisedalpha coefficient to the conventionally acceptable level of 0.8 and above. Itmay be necessary to rethink this questionnaire.

Copyright © 1990-2006 StatsDirectLimited, all rights reserved

Download a free 10 day StatsDirect trial

Logisticregression

Menu location: Analysis_Regression & Correlation_Logistic.

This function fits and analyseslogistic models for binary outcome/response data with one or more predictors.

Binomialdistributions are used for handling the errors associated with regressionmodels for binary/dichotomous responses (i.e. yes/no, dead/alive) in the sameway that the standardnormal distribution is used in generallinear regression. Other, less commonly used binomial models include normit/probit and complimentary log-log. The logistic modelis widely used and has many desirable properties (Hosmer and Lemeshow,1989; Armitage and Berry, 1994; Altman 1991; McCullagh and Nelder, 1989; Coxand Snell, 1989; Pregibon, 1981).

Odds = p/(1- p)

[p = proportional response, i.e.r out of n responded so p = r/n]

Logit = log odds = log(p /(1- p))

When a logistic regression modelhas been fitted, estimates of pare marked with a hat symbol above the Greek letter pito denote that the proportion is estimated from the fitted regression model.Fitted proportional responses are often referred to as event probabilities(i.e. p hat n events out of n trials).

The following information aboutthe difference between two logits demonstrates one ofthe important uses of logistic regression models:

Logistic models provide importantinformation about the relationship between response/outcome and exposure. Itmakes no difference to logistic models, whether outcomes have been sampledprospectively or retrospectively, this is not the case with other binomialmodels.

The general form of a logisticregression is:

- where p hat is theexpected proportional response for the logistic model with regressioncoefficients b1 to k and intercept b0 when the values for thepredictor variables are x1 to k.

Classifier predictors

If one of the predictors in aregression model classifies observations into more than two classes (e.g. bloodgroup) then you should consider splitting it into separate dichotomousvariables as described under dummyvariables.

Data preparation

For individual responses that aredichotomous (e.g. yes/no), enter the total number as 1 and the response as 1 or0 for each observation (usually 1 for yes and 0 for no).

For responses that areproportional, either enter the total number then the number responding or enterthe total number as 1 and then a proportional response (r/n).

Rows with missing data are leftout of the model. If missing data are encountered you are warned that missingdata can cause bias.

Deviance and model analysis

Deviance is minus twice the logof the likelihood ratio for models fitted by maximum likelihood (Hosmer and Lemeshow,1989; Cox and Snell, 1989; Pregibon, 1981). Log likelihood and deviance aregiven under the model analysis option of logistic regression in StatsDirect.

The value of adding parameter toa logistic model can be tested by subtracting the deviance of the model withthe new parameter from the deviance of the model without the new parameter,this difference is then tested against a chi-square distribution with degreesof freedom equal to the difference between the degrees of freedom of the oldand new models. The model analysis option tests the model you specify against amodel with only one parameter, the intercept; this tests the combined value ofthe specified predictors/covariates in the model.

Some statistical packages offerstepwise logistic regression that performs systematic tests for differentcombinations of predictors/covariates. Automatic model building procedures suchas these can be erroneous as they do not consider the real world importance ofeach predictor, for this reason StatsDirect does notinclude stepwise selection.

Three goodnessof fit tests are given for the overall fit of amodel: Pearson, deviance and Hosmer-Lemeshow (Hosmer and Lemeshow,1989). Note that the Hosmer-Lemeshow (decile of risk) test is only applicable when the number ofobservations tied at any one covariate pattern is small in comparison with thetotal number of observations, and when all predictors are continuous variables.

Influential data and oddsratios

The following are provided underthe fits & residuals option for the purpose of identifying influentialdata:

·leverages (diagonal elements of the logistic "hat" matrix)

·deviance residuals

·Pearson residuals

·standardized variances

·deletiondisplacements

·covariances

Approximate confidence intervalsare given for the odds ratios derived from the covariates.

Bootstrap estimates

A bootstrap procedure may be usedto cross-validate confidence intervals calculated for odds ratios derived fromfitted logistic models (Efron and Tibshirani,1997; Gong, 1986). The bootstrap confidence intervals used here are the'bias-corrected' type.

The mechanism that StatsDirect uses is to draw a specified number of randomsamples (with replacement, i.e. some observations are drawn once only, othersmore than once and some not at all) from your data. These 're-samples'are fed back into the logistic regression and bootstrap' estimates ofconfidence intervals for the model parameters are made by examining the modelparameters calculated at each cycle of the process. The bias statistic showshow much each mean model parameter from the bootstrap distribution deviatesfrom observed model parameters.

Classification and ROC curve

The confidence interval givenwith the likelihood ratios in the classification option is constructed usingthe robust approximation given by Koopman (1984)for ratios of binomial proportions. The 'near' cut-off in the classificationoption is the rounding cut-off that gives the maximum sum of sensitivity andspecificity. This value should be the shoulder at the top left of the ROC (receiveroperating characteristic curve).

Prediction and adjusted means

Theprediction option allows you to calculate values of the outcome (as responseproportion) using your fitted logistic model coefficients with a specified setof values for the predictors (X1…p). A confidence interval is given for eachprediction.

The defaultX values shown are those required to calculate the overall regression mean forthe model, which is the mean of Y adjusted for all X. For continuous predictorsthe mean of X is used. For categorical predictors you should use X as 1/k,where k is the number of categories. StatsDirectattempts to identify categorical variables but you should check the valuesagainst these rules if you are using categorical predictors in this way.

Forexample, if a model of Y = logit(proportion of population who are hypertensive), X1 = sex,X2 = age was fitted, and you wanted to know the age and sex adjusted prevalenceof hypertension in the population that you sampled, you could use theprediction function to give the regression mean as the answer, i.e. with X1 as0.5 and X2 as mean age. If you wanted to know the age-adjusted prevalence ofhypertension for males in your population then you would set X1 to 1 (if malesex is coded as 1 in yourdata).

Further methods

GLIMprovides many generalised linear models with linkfunctions including binomial (see non-linearmodels). SAS provides an extension oflogistic regression to ordinal responses, this isknown as ordered logistic regression. Exploratory regression modelling should be attempted only under the expertguidance of a Statistician.

Technical validation

The logits of the response data are fitted using an iterativelyre-weighted least squares method to find maximum likelihood estimates of theparameters in the logistic model (McCullagh andNelder, 1989; Cox and Snell, 1989; Pregibon, 1981).

Residuals and case-wisediagnostic statistics are calculated as follows (Hosmer and Lemeshow,1989):

Leverages are the diagonal elements of the logistic equivalent of the hatmatrix in general linear regression (where leverages are proportional to thedistances of the jth covariate pattern fromthe mean of the data). The jth diagonalelement of the logistic equivalent of the hat matrix is calculated as:

- where mjis the number of trials with the jth covariatepattern, p hat is the expected proportional response, xjis the jth covariate pattern, X is the designmatrix containing all covariates (first column as 1 if intercept calculated)and V is a matrix with the general element p hat(1- p hat).

Deviance residuals are used to detect ill-fitting covariate patterns, and they arecalculated as:

- where mj is the number of trials with the jth covariate pattern, p hat is the expectedproportional response and yj is the number ofsuccesses with the jth covariate pattern.

Pearson residuals are used to detect ill-fitting covariate patterns, and they arecalculated as:

- where mj is the number of trials with the jth covariate pattern, p hat is the expectedproportional response and yj is the number ofsuccesses with the jth covariate pattern.

Standardized Pearson residuals are used to detect ill-fitting covariate patterns, and they arecalculated as:

- where rj is the Pearson residual for the jth covariate pattern and hjis the leverage for the jth covariate pattern.

Deletion displacement (deltabeta) measures the change caused by deleting allobservations with the jth covariate pattern.The statistic is used to detect observations that have a strong influence uponthe regression estimates. This change in regression coefficients is calculatedas:

- where rj is the Pearson residual for the jth covariate pattern and hjis the leverage for the jth covariate pattern.

Standardized deletiondisplacement (std delta beta) measures the change caused by deleting all observations with the jth covariate pattern. The statistic is used todetect observations that have a strong influence upon the regression estimates.This change in regression coefficients is calculated as:

- where rsj is the standardized Pearson residual for the jth covariate pattern and hjis the leverage for the jth covariate pattern.

Deletion chi-square (deltachi-square) measures the change in the Pearsonchi-square statistic (for the fit of the regression) caused by deleting allobservations with the jth covariate pattern.The statistic is used to detect ill-fitting covariate patterns. This change inchi-square is calculated as:

- where rj is the Pearson residual for the jth covariate pattern and hjis the leverage for the jth covariate pattern.

Example

From Altman (1991).

Test workbook (Regressionworksheet: Men, Hypertensive, Smoking, Obesity, Snoring).

Smoking, obesity and snoring wererelated to hypertension in 433 men aged 40 or over.

Men

Hypertensive

Smoking

Obesity

Snoring

60

5

0

0

0

17

2

1

0

0

8

1

0

1

0

2

0

1

1

0

187

35

0

0

1

85

13

1

0

1

51

15

0

1

1

23

8

1

1

1

To analysethese data using StatsDirect you must first enterthem into five columns of a workbook. Alternatively, open the test workbookusing the file open function of the file menu. Then select Logistic from theRegression and Correlation section of the analysis menu. Choose the option toenter grouped data when prompted. Selectthe column marked "Men" when asked for total number and select "Hypertensives" when asked for response. Click on thecancel button when asked about weights, i.e. you select an unweightedanalysis. Then select "Smoking", "Obesity" and"Snoring" in one action when you are asked for predictors. Click onYes when you are asked about an intercept.

For this example:

Logistic regression

Deviance (likelihood ratio)chi-square = 12.507498, df =3 P = 0.0058

Intercept

b0 = -2.377661

z = -6.253967

P < .0001

Smoking

b1 = -0.067775

z = -0.243686

P = .8075

Obesity

b2 = 0.69531

z = 2.438954

P = .0147

Snoring

b3 = 0.871939

z = 2.193152

P = .0283

logit Hypertensive = -2.377661-0.067775 Smoking +0.69531 Obesity +0.871939 Snoring

Logistic regression - modelanalysis

Accuracy = 1.00E-07

Log likelihood with allcovariates = -199.4582

Deviance with all covariates =1.618403, df = 4, rank = 4

Akaike = 9.618403

Schwartz = 12.01561

Deviance with no covariates =14.1259

Deviance (likelihood ratio)chi-square = 12.507498, df =3, P = 0.0058

Pearson chi-square goodness offit = 1.364272, df = 4, P =0.8504

Deviance goodness of fit =1.618403, df = 4, P = 0.8055

Hosmer-Lemeshow type test = 0.551884, df= 3, P = 0.9074

Parameter

Coefficient

Standard Error

Constant

-2.377661

0.380185

Smoking

-0.067775

0.278124

Obesity

0.69531

0.285085

Snoring

0.871939

0.397574

We can infer that smoking has noassociation with hypertension from this evidence and drop it from our model.Remember that there may be important interactions between predictors. The fits& residuals option gives you the covariances. Itwould be prudent to seek statistical advice on the interpretation of covarianceand influential data.

Parameter estimates can be usedto obtain odds ratios for each covariate:

Logistic regression - oddsratios

Parameter

Estimate

Odds Ratio

95% CI

Constant

-2.377661

Smoking

-0.067775

0.934471

0.541784 to 1.611779

Obesity

0.69531

2.00433

1.146316 to 3.504564

Snoring

0.871939

2.391544

1.097143 to 5.213072

Thus with 95% confidence we caninfer that the risk of hypertension in obese people is between 1.15 and 3.5times greater than in non-obese people.

P values

confidenceintervals

Copyright © 1990-2006 StatsDirectLimited, all rights reserved

Download a free 10 day StatsDirect trial

Conditionallogistic regression

Menu location: Analysis_Regression & Correlation_ConditionalLogistic.

This function fits and analysesconditional logistic models for binary outcome/response data with one or morepredictors, where observations are not independent but are matched or groupedin some way.

Binomialdistributions are used for handling the errors associated with regressionmodels for binary/dichotomous responses (i.e. yes/no, dead/alive) in the sameway that the standardnormal distribution is used in generallinear regression. Other, less commonly used binomial models include normit/probit and complimentary log-log. The logistic modelis widely used and has many desirable properties (Hosmer and Lemeshow,1989; Armitage and Berry, 1994; Altman 1991; McCullagh and Nelder, 1989; Coxand Snell, 1989; Pregibon, 1981).

Odds = p/(1- p)

[p = proportional response, i.e.r out of n responded so p = r/n]

Logit = log odds = log(p /(1- p))

When a logistic regression modelhas been fitted, estimates of pare marked with a hat symbol above the Greek letter pito denote that the proportion is estimated from the fitted regression model.Fitted proportional responses are often referred to as event probabilities(i.e. p hat n events out of n trials).

The following information aboutthe difference between two logits demonstrates one ofthe important uses of logistic regression models:

Logistic models provide importantinformation about the relationship between response/outcome and exposure. Itmakes no difference to logistic models, whether outcomes have been sampledprospectively or retrospectively, this is not the case with other binomialmodels.

The conditional logistic modelcan cope with 1:1 or 1:m case-control matching. In thesimplest case, this is an extension of McNemar's testfor matched studies.

Data preparation

You must prepare your data caseby case, i.e. ungrouped, one subject/observation per row,this is unlike the unconditional logistic function that accepts grouped orungrouped data.

The binary outcome variable mustcontain only 0 (control) or 1 (case).

There must be a stratum indicatorvariable to denote the strata. In case-control studies with 1:1 matching thiswould mean a code for each pair (i.e. two rows marked stratum x, one with acase + covariates and the other with a control + covariates). For 1:m matched studies there will be 1+m rows of data for eachstratum/matching-group.

Technical validation

The regression is fitted by maximisation of the natural logarithm of the conditionallikelihood function using Newton-Raphson iteration asdescribed by Krailoet al. (1984), Smith et al. (1981) and Howard (1972).

Example

From Hosmer and Lemeshow(1989).

Test workbook (Regressionworksheet: PAIRID, LBWT, RACE (b), SMOKE, HT, UI, PTD, LWT).

These are artificially matcheddata from a study of the risk factors associated with low birth weight in Massachusetts in 1986.The predictors studied here are black race (RACE (b)), smoking status (SMOKE),hypertension (HT), uterine irritability (UI), previous preterm delivery (PTD)and weight of the mother at her last menstrual period (LWT).

To analysethese data using StatsDirect you must first open thetest workbook using the file open function of the file menu. Then selectConditional Logistic from the Regression and Correlation section of theanalysis menu. Selectthe column marked "PAIRID" when asked for the stratum (match group)indicator. Then select "LBWT" when asked for the case-controlindicator. Then select "RACE (b)", "SMOKE", "HT","UI", "PTD", and "LWT" in one action when you areasked for predictors.

For this example:

Conditional logisticregression

Deviance (-2 log likelihood) =51.589852

Deviance (likelihood ratio) chi-square= 26.042632 P = 0.0002

Pseudo (McFadden) R-square =0.33546

Label

Parameter estimate

Standard error

RACE (b)

0.582272

0.620708

z = 0.938078

P = 0.3482

SMOKE

1.410799

0.562177

z = 2.509528

P = 0.0121

HT

2.351335

1.05135

z = 2.236492

P = 0.0253

UI

1.399261

0.692244

z = 2.021341

P = 0.0432

PTD

1.807481

0.788952

z = 2.290989

P = 0.022

LWT

-0.018222

0.00913

z = -1.995807

P = 0.046

Label

Odds ratio

95% confide医学招聘网nce interval

RACE (b)

1.790102

0.53031 to 6.042622

SMOKE

4.099229

1.361997 to 12.337527

HT

10.499579

1.3374 to 82.429442

UI

4.052205

1.043404 to 15.737307

PTD

6.095073

1.298439 to 28.611218

LWT

0.981943

0.964529 to 0.999673

You may infer from the resultsabove that hypertension, smoking status and previous pre-term delivery areconvincing predictors of low birth weight in the population studied.

Note that the selection ofpredictors for regression models such as this can be complex and is best donewith the help of a Statistician. Hosmer and Lemeshow(1989) give a good discussion of the example above, but with non-standarddummy variables (StatsDirect uses a standarddummy/design variable coding scheme adopted by most other statisticalsoftware). The optimal selection of predictors depends not only upon theirnumerical performance in the model, with or without appropriate transformationsor study of interactions, but also upon their biophysical importance in thestudy.

P values

confidenceintervals

Copyright © 1990-2006 StatsDirectLimited, all rights reserved

Download a free 10 day StatsDirect trial

Poissonregression

Menu location: Analysis_Regression & Correlation_Poisson

This function fits a Poissonregression model for multivariate analysis of numbers of uncommon events in cohort studies.

The multiplicative Poissonregression model is fitted as a log-linear regression (i.e. a log link and aPoisson error distribution), with an offset equal to the natural logarithm ofperson-time if person-time is specified (McCullagh andNelder, 1989; Frome, 1983; Agresti, 2002). With the multiplicative Poissonmodel, the exponents of coefficients are equal to the incidence rate ratio(relative risk). These baseline relative risks give values relative to namedcovariates for the whole population. You can define relative risks for asub-population by multiplying that sub-population's baseline relative risk withthe relative risks due to other covariate groupings, for example the relativerisk of dying from lung cancer if you are a smoker who has lived in a highradon area. StatsDirect offers sub-populationrelative risks for dichotomouscovariates.

The outcome/response variable isassumed to come from a Poissondistribution. Note that a Poisson distribution is the distribution of thenumber of events in a fixed time interval, provided that the events occur atrandom, independently in time and at a constant rate. Poisson distributions areused for modelling events per unit space as well astime, for example number of particles per square centimetre.

Poisson regression can also beused for log-linear modelling of contingency tabledata, and for multinomial modelling. For contingencytable counts you would create r + c indicator/dummy variables as thecovariates, representing the r rows and c columns of the contingency table:

r1c1

r1c2

r1c3

r2c1

r2c2

r2c3

r3c1

r3c2

r3c3

Response

x_r1

x_r2

x_r3

x_c1

x_c2

x_c3

r1c1

1

0

0

1

0

0

r1c2

1

0

0

0

1

0

r1c3

1

0

0

0

0

1

r2c1

0

1

0

1

0

0

r2c2

0

1

0

0

1

0

r2c3

0

1

0

0

0

1

r3c1

0

0

1

1

0

0

r3c2

0

0

1

0

1

0

r3c国家医学考试网3

0

0

1

0

0

1

Adequacy of the model

In order to assess the adequacyof the Poisson regression model you should first look at the basic descriptivestatistics for the event count data. If the count mean and variance are verydifferent (equivalent in a Poisson distribution) then the model is likely to beover-dispersed.

The model analysis option gives ascale parameter (sp) as a measure of over-dispersion; this is equal to thePearson chi-square statistic divided by the number of observations minus thenumber of parameters (covariates and intercept). The variances of thecoefficients can be adjusted by multiplying by sp. The goodness of fit teststatistics and residuals can be adjusted by dividing by sp. Using aquasi-likelihood approach sp could be integrated with the regression, but thiswould assume a known fixed value for sp, which is seldom the case. A betterapproach to over-dispersed Poisson models is to use a parametric alternativemodel, the negative binomial.

The deviance (likelihood ratio)test statistic, G², is the most useful summary of the adequacy of the fittedmodel. It represents the change in deviance between the fitted model and themodel with a constant term and no covariates; therefore G² is not calculated ifno constant is specified. If this test is significant then the covariatescontribute significantly to the model.

The deviance goodness of fit testreflects the fit of the data to a Poisson distribution in the regression. Ifthis test is significant then a red asterisk is shown by the P value, and youshould consider other covariates and/or other error distributions such asnegative binomial.

StatsDirect does not exclude/drop covariates from its Poisson regression ifthey are highly correlated with one another. Models that are not of full (rank= number of parameters) rank are fully estimated in most circumstances, but youshould usually consider combining or excluding variables, or possibly excludingthe constant term. You should seek expert statistical if you find yourself inthis situation.

Technical validation

The deviance function is:

- wherey is the number of events, n is the number of observations and m is the fittedPoisson mean.

The log-likelihood function is:

The maximum likelihood regressionproceeds by iteratively re-weighted least squares, using singular valuedecomposition to solve the linear system at each iteration,until the change in deviance is within the specified accuracy.

The Pearson chi-square residualis:

The Pearson goodness of fit teststatistic is:

The deviance residual is (Cook and Weisberg,1982):

The Freeman-Tukey,variance stabilized, residual is (Freeman and Tukey,1950):

The standardized residual is:

- whereh is the leverage (diagonal of the Hat matrix).

Example

From Armitage et al.(2001):

Test workbook (Regressionworksheet: Cancers, Subject-years, Veterans, Age group).

To analysethese data using StatsDirect you must first open thetest workbook using the file open function of the file menu. Next generate aset of dummy variables to represent the levels of the "Age group"variable using the DummyVariables function of the Data menu. Then select Poisson from theRegression and Correlation section of the Analysis menu. Selectthe column marked "Cancers" when asked for the response. Then select"Subject-years" when asked for person-time. Then select"Veterans", "Age group (25-29)" , "Age group(30-34)" etc. in one action when you are asked for predictors.

For this example:

Poisson regression

Deviance (likelihood ratio)chi-square = 2067.700372 df= 11 P < 0.0001

Intercept

b0 = -9.324832

z = -45.596773

P < 0.0001

Veterans

b1 = -0.003528

z = -0.063587

P = 0.9493

Age group (25-29)

b2 = 0.679314

z = 2.921869

P = 0.0035

Age group (30-34)

b3 = 1.371085

z = 6.297824

P < 0.0001

Age group (35-39)

b4 = 1.939619

z = 9.14648

P < 0.0001

Age group (40-44)

b5 = 2.034323

z = 9.413835

P < 0.0001

Age group (45-49)

b6 = 2.726551

z = 12.269534

P < 0.0001

Age group (50-54)

b7 = 3.202873

z = 14.515926

P < 0.0001

Age group (55-59)

b8 = 3.716187

z = 17.064363

P < 0.0001

Age group (60-64)

b9 = 4.092676

z = 18.801188

P < 0.0001

Age group (65-69)

b10 = 4.23621

z = 18.892791

P < 0.0001

Age group (70+)

b11 = 4.363717

z = 19.19183

P < 0.0001

log Cancers [offsetlog(Veterans)] = -9.324832 -0.003528 Veterans +0.679314 Age group (25-29)+1.371085 Age group (30-34) +1.939619 Age group (35-39) +2.034323 Age group(40-44) +2.726551 Age group (45-49) +3.202873 Age group (50-54) +3.716187 Agegroup (55-59) +4.092676 Age group (60-64) +4.23621 Age group (65-69) +4.363717Age group (70+)

Poisson regression - incidencerate ratios

Inference population: whole study(baseline risk)

Parameter

Estimate

IRR

95% CI

Veterans

-0.003528

0.996479

0.89381 to 1.11094

Age group (25-29)

0.679314

1.972524

1.250616 to 3.111147

Age group (30-34)

1.371085

3.939622

2.571233 to 6.036256

Age group (35-39)

1.939619

6.956098

4.590483 to 10.540786

Age group (40-44)

2.034323

7.647073

5.006696 to 11.679905

Age group (45-49)

2.726551

15.280093

9.884869 to 23.620062

Age group (50-54)

3.202873

24.60311

15.96527 to 37.914362

Age group (55-59)

3.716187

41.107367

26.825601 to 62.992647

Age group (60-64)

4.092676

59.899957

39.096281 to 91.773558

Age group (65-69)

4.23621

69.145275

44.555675 to 107.305502

Age group (70+)

4.363717

78.54856

50.303407 to 122.653248

Poisson regression - modelanalysis

Accuracy = 1.00E-07

Log likelihood with allcovariates = -66.006668

Deviance with all covariates =5.217124, df = 10, rank = 12

Akaike information criterion = 29.217124

Schwartz information criterion =45.400676

Deviance with no covariates =2072.917496

Deviance (likelihood ratio, G²) =2067.700372, df = 11, P <0.0001

Pseudo (McFadden) R-square =0.997483

Pseudo (likelihood ratio index)R-square = 0.939986

Pearson goodness of fit =5.086063, df = 10, P =0.8854

Deviance goodness of fit =5.217124, df = 10, P =0.8762

Over-dispersion scale parameter =0.508606

Scaled G² = 4065.424363, df = 11, P < 0.0001

Scaled Pearson goodness of fit =10, df = 10, P = 0.4405

Scaled Deviance goodness of fit =10.257687, df = 10, P =0.4182

Parameter

Coefficient

Standard Error

Constant

-9.324832

0.204506

Veterans

-0.003528

0.055478

Age group (25-29)

0.679314

0.232493

Age group (30-34)

1.371085

0.217708

Age group (35-39)

1.939619

0.212062

Age group (40-44)

2.034323

0.216099

Age group (45-49)

2.726551

0.222221

Age group (50-54)

3.202873

0.220645

Age group (55-59)

3.716187

0.217775

Age group (60-64)

4.092676

0.217682

Age group (65-69)

4.23621

0.224224

Age group (70+)

4.363717

0.227374

Parameter

Scaled Standard Error

Scaled Wald z

Constant

0.145847

-63.935674

P < 0.0001

Veterans

0.039565

-0.089162

P = 0.929

Age group (25-29)

0.165806

4.097037

P < 0.0001

Age group (30-34)

0.155262

8.830792

P < 0.0001

Age group (35-39)

0.151235

12.825169

P < 0.0001

Age group (40-44)

0.154115

13.200054

P < 0.0001

Age group (45-49)

0.158481

17.204308

P < 0.0001

Age group (50-54)

0.157357

20.354193

P < 0.0001

Age group (55-59)

0.15531

23.927605

P < 0.0001

Age group (60-64)

0.155243

26.362975

P < 0.0001

Age group (65-69)

0.159909

26.491421

P < 0.0001

Age group (70+)

0.162155

26.910733

P < 0.0001

With 95% confidence you can inferthat the risk of cancer in these veterans compared with non-veterans liesbetween 0.89 and 1.11, i.e. a statistically non-significant effect.

P values

confidenceintervals

Copyright © 1990-2006 StatsDirectLimited, all rights reserved

Download a free 10 day StatsDirect trial

Kendall'srank correlation

Menu location: Analysis_Non-parametric_Kendall Rank Correlation.

Kendall's rank correlation provides a distribution free test ofindependence and a measure of the strength of dependence between two variables.

Spearman's rank correlation issatisfactory for testing a null hypothesis of independence between twovariables but it is difficult to interpret when the null hypothesis isrejected. Kendall's rank correlation improvesupon this by reflecting the strength of the dependence between the variablesbeing compared.

Consider two samples, x and y,each of size n. The total number of possible pairings of x with y observationsis n(n-1)/2. Now consider ordering the pairs by the xvalues and then by the y values. If x3 > y3 when ordered on both x and ythen the third pair is concordant, otherwise the third pair is discordant. S isthe difference between the number of concordant (ordered in the same way, nc) and discordant (ordereddifferently, nd) pairs.

Tau (t) is related to S by:

If there are tied (same value)observations then tb is used:

- where ti is the number of observations tied at a particular rankof x and u is the number tied at a rank of y.

In thepresence of ties the statistic tb is given as avariant of t adjusted for ties (Kendall,1970). When there are no ties tb = t. An approximate confidence interval is given for tb or t. Please notethat the confidence interval does not correspond exactly to the P values of thetests because slightly different assumptions are made (Samra and Randles,1988).

The gamma coefficient is given asa measure of association that is highly resistant to tied data (Goodman and Kruskal, 1963):

Tests forKendall's test statistic being zero are calculated in exact form when there areno tied data, and in approximate form through a normalisedstatistic with and without a continuity correction (Kendall's score reduced by1).

Technical Validation

Anasymptotically distribution-free confidence interval is constructed for tb or t using thevariant of the method of Samra and Randles(1988) described by Hollanderand Wolfe (1999).

In thepresence of ties, the normalised statistic iscalculated using the extended variance formula given by Hollanderand Wolfe (1999). In theabsence of ties, the probability of null S (and thus t) is evaluated using a recurrence formula when n < 9and an Edgeworth series expansion when n ³ 9 (Best and Gipps, 1974). In the presence of ties you are guided tomake inferences from the normal approximation (Kendall and Gibbons,1990; Conover, 1999; Hollander and Wolfe, 1999). Note that StatsDirectuses more accurate methods for calculating the P values associated with t than some otherstatistical software, therefore, there may be differences in results.

Example

From Armitage and Berry(1994, p. 466).

Test workbook (Nonparametricworksheet: Career, Psychology).

The following data represent atutor's ranking of ten clinical psychology students as to their suitability fortheir career and their knowledge of psychology:

Career

Psychology

4

5

10

8

3

6

1

2

9

10

2

3

6

9

7

4

8

7

5

1

To analysethese data in StatsDirect you must first enter theminto two columns in the workbook. Alternatively, open the test workbook usingthe file open function of the file menu. Then select Kendall Rank Correlationfrom the Non-parametric section of the analysis menu. Selectthe columns marked "Career" and "Psychology" when promptedfor data.

For this example:

Kendall's tau = 0.5111

Approximate 95% CI = 0.1352 to0.8870

Upper side (H1 concordance) P =.0233

Lower side (H1 discordance) P =.9767

Two sided (H1 dependence) P =.0466

From these results we reject thenull hypothesis of mutual independence between the career suitability andpsychology knowledge rankings for the students. With a two sided test we areconsidering the possibility of concordance or discordance (akin to positive ornegative correlation). A one sided test would have been restricted to eitherdiscordance or concordance, this would be an unusualassumption. In our example we can conclude that there is a statisticallysignificant lack of independence between career suitability and psychologyknowledge rankings of the students by the tutor. The tutor tended to rankstudents with apparently greater knowledge as more suitable to their careerthan those with apparently less knowledge and vice versa.

P values

referencelist

Copyright © 1990-2006 StatsDirectLimited, all rights reserved

Download a free 10 day StatsDirect trial

Non-parametriclinear regression

Menu location: Analysis_Non-parametric_ Non-parametric LinearRegression.

This is a distribution freemethod for investigating a linear relationship between two variables Y(dependent, outcome) and X (predictor, independent).

The slope b of the regression (Y=bX+a) is calculated as the median of the gradients from allpossible pairwise contrasts of your data. Aconfidence interval based upon Kendall'st isconstructed for the slope.

Non-parametric linear regressionis much less sensitive to extreme observations (outliers) than is simplelinear regression based upon the least squares method. If your data containextreme observations which may be erroneous but you do not have sufficientreason to exclude them from the analysis then non-parametric linear regressionmay be appropriate.

Assumptions:

·The sample is random(X can be non-random provided that Ys are independentwith identical conditional distributions).

·The regression of Yon X is linear (this implies an interval measurement scale for both X and Y).

This function also provides youwith an approximate two sided Kendall's rank correlation test forindependence between the variables.

Technical Validation

Note that the two sidedconfidence interval for the slopeis the inversion of the two sided Kendall'stest. The approximate two sided P value for Kendall's t or tb is given but the exact quantile fromKendall's distribution isused to construct the confidence interval, therefore, there may be slightdisagreement between the P value and confidence interval. If there are manyties then this situation is compounded (Conover, 1999).

Example

From Conover (1999, p.338).

Test workbook (Nonparametricworksheet: GPA, GMAT).

The following data represent testscores for 12 graduates respectively:

GPA

GMTA

4.0

710

4.0

610

3.9

640

3.8

580

3.7

545

3.6

560

3.5

610

3.5

530

3.5

560

3.3

540

3.2

570

3.2

560

To analysethese data in StatsDirect you must first enter theminto two columns in the workbook. Alternatively, open the test workbook usingthe file open function of the file menu. Then select Non-parametric LinearRegression from the Non-parametric section of the analysis menu. Selectthe columns marked "GMTA" and "GPA" when prompted for Y andX variables respectively.

For this example:

GPA vs. GMTA

Observations per sample = 12

Median slope (95% CI) = 0.003485(0 to 0.0075)

Y-intercept = 1.581061

Kendall's rank correlation coefficient tau b =0.439039

Two sided (on continuitycorrected z) P = .0678

If you plot GPA against GMTAscores using the scatter plot function in the graphics menu, you will see thatthere is a reasonably straight line relationship between GPA and GMTA. Here wecan infer with 95% confidence that the true population value of the slope of alinear regression line for these two variables lies between 0 and 0.008. Theregression equation is estimated at Y = 1.5811 + 0.0035X.

From the two sided Kendall's rank correlation test, we can not reject thenull hypothesis of mutual independence between the pairs of results for thetwelve graduates. Note that the zero lower confidence interval is a marginalresult and we may have rejected the null hypothesis had we used a differentmethod for testing independence.

P values

confidenceintervals

Copyright © 1990-2006 StatsDirectLimited, all rights reserved

Download a free 10 day StatsDirect trial

Coxregression

Menu location: Analysis_Survival_Cox Regression.

This function fits Cox'sproportional hazards model for survival-time (time-to-event) outcomes on one ormore predictors.

Cox regression (or proportionalhazards regression) is method for investigating the effect of several variablesupon the time a specified event takes to happen. In the context of an outcomesuch as death this is known as Cox regression for survival analysis. The methoddoes not assume any particular "survival model" but it is not trulynon-parametric because it does assume that the effects of the predictorvariables upon survival are constant over time and are additive in one scale.You should not use Cox regression without the guidance of a Statistician.

Provided that the assumptions ofCox regression are met, this function will provide better estimates of survivalprobabilities and cumulative hazard than those provided by the Kaplan-Meierfunction.

Hazard and hazard-ratios

Cumulative hazard at a time t isthe risk of dying between time 0 and time t, and the survivor function at timet is the probability of surviving to time t (see also Kaplan-Meierestimates).

The coefficients in a Coxregression relate to hazard; a positive coefficient indicates a worse prognosisand a negative coefficient indicates a protective effect of the variable withwhich it is associated.

The hazards ratio associated witha predictor variable is given by the exponent of its coefficient; this is givenwith a confidence interval under the "coefficient details" option in StatsDirect. The hazards ratio may also be thought of asthe relative death rate, see Armitage and Berry(1994). The interpretation of the hazards ratio depends upon themeasurement scale of the predictor variable in question, see Sahai and Kurshid(1996) for further information on relative risk of hazards.

Time-dependent and fixedcovariates

In prospective studies, whenindividuals are followed over time, the values of covariates may change withtime. Covariates can thus be divided into fixed and time-dependent. A covariateis time dependent if the difference between its values for two differentsubjects changes with time; e.g. serum cholesterol. A covariate is fixed if itsvalues can not change with time, e.g. sex or race. Lifestyle factors andphysiological measurements such as blood pressure are usually time-dependent.Cumulative exposures such as smoking are also time-dependent but are oftenforced into an imprecise dichotomy, i.e. "exposed" vs."not-exposed" instead of the more meaningful "time ofexposure". There are no hard and fast rules about the handling of timedependent covariates. If you are considering using Cox regression you shouldseek the help of a Statistician, preferably at the design stage of the investigation.

Model analysis and deviance

A test of the overall statisticalsignificance of the model is given under the "model analysis" option.Here the likelihood chi-square statistic is calculated by comparing thedeviance (- 2 * log likelihood) of your model, with all of the covariates youhave specified, against the model with all covariates dropped. The individualcontribution of covariates to the model can be assessed from the significancetest given with each coefficient in the main output; this assumes a reasonablylarge sample size.

Deviance is minus twice the logof the likelihood ratio for models fitted by maximum likelihood (Hosmer and Lemeshow,1989 and 1999; Cox and Snell, 1989; Pregibon, 1981). The value of adding aparameter to a Cox model is tested by subtracting the deviance of the modelwith the new parameter from the deviance of the model without the newparameter, the difference is then tested against a chi-square distribution withdegrees of freedom equal to the difference between the degrees of freedom ofthe old and new models. The model analysis option tests the model you specifyagainst a model with only one parameter, the intercept; this tests the combinedvalue of the specified predictors/covariates in the model.

Some statistical packages offerstepwise Cox regression that performs systematic tests for differentcombinations of predictors/covariates. Automatic model building procedures suchas these can be misleading as they do not consider the real-world importance ofeach predictor, for this reason StatsDirect does notinclude stepwise selection.

Survival and cumulative hazardrates

The survival/survivorshipfunction and the cumulative hazard function (as discussed under Kaplan-Meier)are calculated relative to the baseline (lowest value of covariates) at eachtime point. Cox regression provides a better estimate of these functions thanthe Kaplan-Meier method when the assumptions of the Cox model are met and thefit of the model is strong.

You are given the option to‘centre continuous covariates’ – this makes survival and hazard functionsrelative to the mean of continuous variables rather than relative to theminimum, which is usually the most meaningful comparison.

If you have binary/dichotomouspredictors in your model you are given the option to calculate survival andcumulative hazards for each variable separately.

Data preparation

·Time-to-event, e.g. time a subject in atrial survived.

·Event / censor code - this must be ³1 (event(s) happened) or 0 (no event atthe end of the study, i.e. "right censored").

·Strata - e.g. centre code for amulti-centre trial. Be careful with your choice of strata; seek the advice of aStatistician.

·Predictors - these are also referred toas covariates, which can be a number of variables that are thought to berelated to the event under study. If a predictor is a classifier variable withmore than two classes (i.e. ordinal or nominal) then you must first use the dummyvariable function to convert it to a series of binary classes.

Technical validation

StatsDirect optimises the log likelihood associatedwith a Cox regression model until the change in log likelihood with iterationsis less than the accuracy that you specify in the dialog box that is displayedjust before the calculation takes place (Lawless, 1982;Kalbfleisch and Prentice, 1980; Harris, 1991; Cox and Oakes, 1984; Le, 1997;Hosmer and Lemeshow, 1999).

The calculation options dialog boxsets a value (default is 10000) for "SPLITTING RATIO"; this is theratio in proportionality constant at a time t above which StatsDirectwill split your data into more strata and calculate an extended likelihoodsolution, see Brysonand Johnson, (1981).

Ties are handled by Breslow's approximation (Breslow, 1974).

Cox-Snell residuals arecalculated as specified by Cox and Oakes (1984).Cox-Snell, Martingale and deviance residuals are calculated as specified by Collett (1994).

Baseline survival and cumulativehazard rates are calculated at each time. Maximum likelihood methods are used,which are iterative when there is more than one death/event at an observed time(Kalbfleisch andPrentice, 1973). Other software may use the less precise Breslow estimates for these functions.

Example

From Armitage and Berry(1994, p. 479).

Test workbook (Survival worksheet:Stage Group, Time, Censor).

The following data represent thesurvival in days since entry to the trial of patients with diffuse histiocytic lymphoma. Two different groups of patients,those with stage III and those with stage IV disease,are compared.

Stage3: 6, 19, 32, 42, 42, 43*, 94, 126*, 169*, 207, 211*, 227*, 253, 255*, 270*,310*, 316*, 335*, 346*

Stage4: 4, 6, 10, 11, 11, 11, 13, 17, 20, 20, 21, 22, 24, 24, 29, 30, 30, 31, 33,34, 35, 39, 40, 41*, 43*, 45, 46, 50, 56, 61*, 61*, 63, 68, 82, 85, 88, 89, 90,93, 104, 110, 134, 137, 160*, 169, 171, 173, 175, 184, 201, 222, 235*, 247*,260*, 284*, 290*, 291*, 302*, 304*, 341*, 345*

* = censored data (patient stillalive or died from an unrelated cause)

To analysethese data in StatsDirect you must first prepare themin three workbook columns as shown below:

Stage group

Time

Censor

1

6

1

1

19

1

1

32

1

1

42

1

1

42

1

1

43

0

1

94

1

1

126

0

1

169

0

1

207

1

1

211

0

1

227

0

1

253

1

1

255

0

1

270

0

1

310

0

1

316

0

1

335

0

1

346

0

2

4

1

2

6

1

2

10

1

2

11

1

2

11

1

2

11

1

2

13

1

2

17

1

2

20

1

2

20

1

2

21

1

2

22

1

2

24

1

2

24

1

2

29

1

2

30

1

2

30

1

2

31

1

2

33

1

2

34

1

2

35

1

2

39

1

2

40

1

2

41

0

2

43

0

2

45

1

2

46

1

2

50

1

2

56

1

2

61

0

2

61

0

2

63

1

2

68

1

2

82

1

2

85

1

2

88

1

2

89

1

2

90

1

2

93

1

2

104

1

2

110

1

2

134

1

2

137

1

2

160

0

2

169

1

2

171

1

2

173

1

2

175

1

2

184

1

2

201

1

2

222

1

2

235

0

2

247

0

2

260

0

2

284

0

2

290

0

2

291

0

2

302

0

2

304

0

2

341

0

2

345

0

Alternatively, open the testworkbook using the file open function of the file menu. Then select Coxregression from the survival analysis section of the analysis menu. Selectthe column marked "Time" when asked for the times, select"Censor" when asked for death/censorship,click on the cancel button when asked about strata and when asked aboutpredictors and select the column marked "Stage group".

For this example:

Cox(proportional hazards) regression

80subjects with 54 events

Deviance(likelihood ratio) chi-square = 7.634383 df= 1 P = 0.0057

Stagegroup b1 = 0.96102 z = 2.492043 P = 0.0127

Coxregression - hazard ratios

Parameter

Hazard ratio

95% CI

Stage group

2.614362

1.227756 to 5.566976

Parameter

Coefficient

Standard Error

Stage group

0.96102

0.385636

Coxregression - model analysis

Loglikelihood with no covariates = -207.554801

Loglikelihood with all model covariates = -203.737609

Deviance(likelihood ratio) chi-square = 7.634383 df= 1 P = 0.0057

The significance test for thecoefficient b1 tests the null hypothesis that it equals zero and thus that itsexponent equals one. The confidence interval for exp(b1)is therefore the confidence interval for the relative death rate or hazardratio; we may therefore infer with 95% confidence that the death rate fromstage 4 cancers is approximately 3 times, and at least 1.2 times, the risk fromstage 3 cancers.

...
关于我们 - 联系我们 -版权申明 -诚聘英才 - 网站地图 - 医学论坛 - 医学博客 - 网络课程 - 帮助
医学全在线 版权所有© CopyRight 2006-2046,
皖ICP备06007007号
百度大联盟认证绿色会员可信网站 中网验证
Baidu
map