医学统计学-电子教材:Basic Descriptive Statistics

来源：南方医科大学精品课程网精品课程网

医学统计学:电子教材 Basic Descriptive Statistics:ContentBasicDescriptiveStatisticsQuickunivariatesummaryUnivariatesummaryCentraltendencyVarianceStandarddeviationStandarderrorSkewnessFrequenciesCrosstabsCopyright?1990-2006StatsDirectLimited,allr

Content

Book Basic Descriptive Statistics

Page Quick univariate summary

Page Univariate summary

Page Central tendency

Page Variance

Page Standard deviation

Page Standard error

Page Skewness

Page Frequencies

Page Crosstabs

Download a free 10 day StatsDirect trial

Quick univariate summary. 1

Univariate summary. 2

Central tendency. 6

Variance, standard deviation and spread.. 6

Variance, standard deviation and spread.. 7

Variance, standard deviation and spread.. 8

Skewness. 9

Frequencies. 10

Crosstabs. 11

Quick univariate summary

Menu location: Analysis_Descriptive_Quick Summary.

This function provides rapidaccess to descriptive statistics for a worksheet column of data.

Shortcut: click on the right mouse button when the mouse cursor is over thecolumn of data you want to describe and you will be given summary statisticsfor that column, provided the setting of the Edit_Optionsmenu item is set to "Column summary".

The statistics calculated hereare a sub-set of those available through the Analysis_Descriptive_DescriptiveReport menu function. If you want to calculate summary statistics for morethan one column at a time then you must use the Analysis_Descriptive_DescriptiveReport menu function.

For definitions of the statisticscalculated, please see descriptivereport.

Download a free 10 day StatsDirect trial

Univariatesummary

Menu locations:

Analysis_Descriptive_Univariate Summary;

Analysis_Descriptive_Weighted Univariate Summary.

This function provides measuresof location and dispersion which describe the data in a worksheet column. Youare given the number, arithmetic mean, sum, variance, standard deviation,standard error of the arithmetic mean, coefficient of variance, confidenceinterval for the arithmetic mean, geometric mean, coefficient of skewness, coefficient of kurtosis, maximum, upper quartile,median, lower quartile, minimum and range for each selected variable. You canalso choose to calculate an additional quantile andthis is appended to the results listed above. Incalculable results aredisplayed as missing data using an asterisk (*).

If you selectmore than one column of data to describe then you are given an option to savethe results to worksheet columns. Saved columns of results represent thestatistics, mean, median etc., and their rows represent the variables/columnsyou selected to describe.

Confidence limits (boundaries ofthe confidence interval)are given for the arithmetic mean. Please see quantileconfidence interval for confidence intervals for the median and othermeasures of location.

Some related topics:

·central tendency

·variance, standard deviation and spread

·skewness

·normal distribution

·quantiles

·quantile confidence intervals

·histogram

Please refer to one of thegeneral textbooks listed in the reference sectionfor discussion of the application and relative merits of individual descriptivestatistics.

Definitions

Valid data and missing data:

For each worksheet column thatyou select, the number of valid data are the number of cells that can beinterpreted as numbers, the remaining cells that can not be interpreted asnumbers are counted as missing (e.g. empty cell, asterisk or text label). Thesample size used in the calculations below is the number of valid data.

Sum, mean, variance, standarddeviation, standard error and variance coefficient:

- where S is the summation for allobservations (xi) in a sample, x bar is the sample (arithmetic) mean, n is thesample size, s² is the sample variance, s is the sample standard deviation, sem is the standard error of the sample mean, upper andlower CL are the confidence limits of the confidence interval for the mean, ta, n-1 is the(100*a)% two tailed quantile from the Student tdistribution with n-1 degrees of freedom, and vc isthe variance coefficient.

Skewness and kurtosis:

- where S is thesummation for all observations (xi) in a sample, x bar is the sample mean and nis the sample size. Note that there are other definitions of these coefficientsused by some other statistical software. StatsDirectuses the standard definitions for which critical values are published instandard statistical tables (Pearson and Hartley,1970; Stuart and Ord, 1994).

Geometric mean:

The geometric mean is a usefulmeasure of central tendency for samples that are log-normally distributed (i.e.the logarithms of the observations are from an approximately normaldistribution). The geometric mean is not calculated for samples that containnegative values.

- where S is thesummation for all observations (xi) in a sample, lnis the natural (base e) logarithm, exp is the exponent (anti-logarithm for basee), gm is the sample geometric mean and n is the sample size.

Weights:

If weights are selected then theweights that you supply are first normalised so thatthey sum to the total number of observations n:

- wherevi is a user supplied weight and wi is the normalised weight.

The following formulae replacethe mean, variance and moments calculations defined above when weights areused:

Median, quartiles and range:

For samples that are not from anapproximately normal distribution, for example when data are censored to removevery large and/or very small values, the following nonparametric statisticsshould be used in place of the arithmetic mean, its variance and the otherparametric measures above.

Median (50th centile,quantile 0.5), lower quartile (25th centile, quantile 0.25) and upperquartile (75th centile, quantile0.75) are defined generally as quantiles:

Two different quantiledefinitions (Weisberg,1992; Gleason, 1997; Stuart and Ord, 1994) are used in the summarystatistics, the first allows for weights and the second is the conventional quantile that is also used in the quantileconfidence interval function:

Type 1

- where p is a proportion, Q isthe pth quantile (e.g.median is Q(0.5)), u is an observation from a sample after it has been orderedfrom smallest to largest value, n is the sample size, w is a weight normalised so that it sums to n and

Type 2

- where p is a proportion, Q isthe pth quantile (e.g.median is Q(0.5)), fix is the integer part of a real number, h is thefractional part of order statistic i, u is anobservation from a sample after it has been ordered from smallest to largestvalue and n is the sample size.

Technical validation

The computational methods used inStatsDirect univariatesummary statistics, including this function, provide 15 decimal places ofprecision. This is tested against known standards such as the reference dataset used in the example below.

Example

Test workbook (Parametricworksheet: Michelson).

The data are 100 measurements ofthe speed (millions of meters per second) of light in air recorded by Michelsonin 1879 (Dorsey,1944). The American National Institute of Standards and Technology usethese data as part of the Statistical Reference Datasets for testingstatistical software (McCullough andWilson, 1999; http://www.nist。gov.itl/div898/strd).

Open the test workbook and selectthe "Michelson" column. Choose descriptive report from thedescriptive section of the analysis menu and click on OK when you see a list ofdescriptive statistics options.

Results from StatsDirect(with decimal places in Analysis_Options set to 12and centile type 2 selected):

Descriptive statistics

Variables	Michelson
Valid data	100
Missing data	0
Sum	29985.24
Mean	299.8524
Variance	0.006242666667
Standard deviation	0.079010547819
Variance coefficient	0.000263498134
Standard error of mean	0.007901054782
Upper 95% CL of mean	299.868077406834
Lower 95% CL of mean	299.836722593166
Geometric mean	299.852389694496
Skewness	-0.01825961396
Kurtosis	3.263530532311
Maximum	300.07
Upper quartile	299.895
Median	299.85
Lower quartile	299.805
Minimum	299.62
Range	0.45
Centile 95	299.98
Centile 5	299.73

Download a free 10 day StatsDirect trial

Centraltendency

The three common measures ofcentral tendency of a distribution are the arithmeticmean, the medianand the mode. Think of a distribution in terms of anhistogram with many bars; a large sample from a normal distribution woulddescribe a bell shaped curve that is symmetrical. In a perfectly symmetrical,non-skeweddistribution the mean, median and mode are equal. As distributions become moreskewed the difference between these different measures of central tendency getslarger.

The mode is the most commonly occurringvalue in a distribution, population or sample.

The mean (arithmetic mean) is theaverage (sum of observations / number of observations) in a distribution,sample or population. The mean is more sensitive to outliers than the median ormode.

The median is the middle value ina sorted distribution, sample or population. When there is an even number ofobservations the median is the mean of the two central values.

Download a free 10 day StatsDirect trial

Variance,standard deviation and spread

The standard deviation of themean (SD) is the most commonly used measure of the spread of values in adistribution. SD is calculated as the square root of the variance (the averagesquared deviation from the mean).

Variance in a population is:

[x is avalue from the population, m is the mean of all x, n is the number of x in the population, S is thesummation]

Variance is usually estimatedfrom a sample drawn from a population. The unbiased estimate of populationvariance calculated from a sample is:

[x is anobservation from the sample, x-bar is the sample mean, n (sample size) -1 is degrees offreedom, S is the summation]

The spread of a distribution isalso referred to as dispersion and variability. All three terms mean the extentto which values in a distribution differ from one another.

SD is the best measure of spreadof an approximately normal distribution. This is not the casewhen there are extreme values in a distribution or when the distribution isskewed, in these situations interquartile range orsemi-interquartile are preferred measures ofspread. Interquartile range is the difference betweenthe 25th and 75th centiles. Semi-interquartilerange is half of the difference between the 25th and 75th centiles.For any symmetrical (not skewed) distribution, half of its values will lie one semi-interquartile rangeeither side of the median, i.e. in the interquartilerange. When distributions are approximately normal, SD is a better measure ofspread because it is less susceptible to sampling fluctuation than (semi-)interquartile range.

If a variable y is a linear (y =a + bx) transformation of x then the variance of y isb² times the variance of x and the standard deviation of y is b times thevariance of x.

The standard error of the mean isthe expected value of the standard deviation of means of several samples, this is estimated from a single sample as:

[s isstandard deviation of the sample mean, n is the sample size]

See descriptivestatistics.

Download a free 10 day StatsDirect trial

Variance,standard deviation and spread

Variance in a population is:

[x is avalue from the population, m is the mean of all x, n is the number of x in the population, S is thesummation]

Variance is usually estimatedfrom a sample drawn from a population. The unbiased estimate of populationvariance calculated from a sample is:

[x is anobservation from the sample, x-bar is the sample mean, n (sample size) -1 is degrees offreedom, S is the summation]

The spread of a distribution isalso referred to as dispersion and variability. All three terms mean the extentto which values in a distribution differ from one another.

If a variable y is a linear (y =a + bx) transformation of x then the variance of y isb² times the variance of x and the standard deviation of y is b times thevariance of x.

The standard error of the mean isthe expected value of the standard deviation of means of several samples, this is estimated from a single sample as:

[s isstandard deviation of the sample mean, n is the sample size]

See descriptivestatistics.

Download a free 10 day StatsDirect trial

Variance,standard deviation and spread

Variance in a population is:

[x is avalue from the population, m is the mean of all x, n is the number of x in the population, S is thesummation]

Variance is usually estimatedfrom a sample drawn from a population. The unbiased estimate of populationvariance calculated from a sample is:

[x is anobservation from the sample, x-bar is the sample mean, n (sample size) -1 is degrees offreedom, S is the summation]

The spread of a distribution isalso referred to as dispersion and variability. All three terms mean the extentto which values in a distribution differ from one another.

If a variable y is a linear (y =a + bx) transformation of x then the variance of y isb² times the variance of x and the standard deviation of y is b times thevariance of x.

The standard error of the mean isthe expected value of the standard deviation of means of several samples, this is estimated from a single sample as:

[s isstandard deviation of the sample mean, n is the sample size]

See descriptivestatistics.

Download a free 10 day StatsDirect trial

Skewness

Skewness describes the asymmetry of a distribution. A skewed distributiontherefore has one tail longer than the other.

A positively skeweddistribution has a longer tail to the right:

A negatively skeweddistribution has a longer tail to the left:

A distribution with no skew(e.g. a normal distribution) is symmetrical:

In a perfectly symmetrical,non-skewed, distribution the mean, median and mode are equal. As distributionsbecome more skewed the difference between these different measures of centraltendency gets larger.

Positively skewed distributionsare more common than negatively skewed ones.

A coefficient of skewness for a sample is calculated by StatsDirectas:

- wherexi is a sample observation, x bar is the sample mean and n is the sample size.

Skewed distributions cansometimes be "normalized" by transformation.

See descriptivestatistics.

Download a free 10 day StatsDirect trial

Frequencies

Menu location: Analysis_Frequencies.

This function gives the actualand relative values for frequency and cumulative frequency of observations inthe samples you select. If you want the cumulative frequencies to representorder then sortthe data before using this function.

Example

The following represent responsesto an element of a questionnaire that used a Likertscale:

response

In order to analysethese data in StatsDirect, fi 卫生资格考试网rst enter them into aworkbook column. Then selectthis column and choose the frequencies option of the analysis menu.

For this example:

N = 8

Value	Frequency	Relative %	Cumulative	Relative %
1	2	25	2	25
2	1	12.5	3	37.5
3	3	37.5	6	75
4	1	12.5	7	87.5
5	1	12.5	8	100

Download a free 10 day StatsDirect trial

Crosstabs

Menu location: Analysis_Crosstabs.

This a two orthree way cross tabulation function. If you havetwo columns of numbers that correspond to different classifications of the sameindividuals then you can use this function to give a two way frequency tablefor the cross classification. This can be stratified by a third classificationvariable.

For two way crosstabs, StatsDirect offers a range of analyses appropriate to thedimensions of the contingency table. For more information see chi-squaretests and exacttests.

For three way crosstabs, StatsDirect offers either odds ratio(for case-control studies) or relative risk(for cohort studies) meta-analyses for 2 by 2 by k tables, and generalisedCochran-Mantel-Haenszel tests for r by c by k tables.

Example

A database of test scorescontains two fields of interest, sex (M=1, F=0) and grade of skin reaction toan antigen (none = 0, weak + = 1, strong + = 2). Here is a list of those fieldsfor 10 patients:

Sex	Reaction
0	0
1	1
1	2
0	2
1	2
0	1
0	0
0	1
1	2
1	0

In order to get a crosstabulation of these from StatsDirect you should enterthese data in two workbook columns. Then choose crosstabs from the analysis menu.

For this example:

		Reaction
		0	1	2
Sex	0	2	2	1
	1	1	1	3

We could then proceed to an r byc (2 by 3) contingencytable analysis to look for association between sex and reaction to thisantigen:

Contingency table analysis

Observed	2	2	1	5
% of row	40%	40%	20%
% of col	66.67%	66.67%	25%	50%

Observed	1	1	3	5
% of row	20%	20%	60%
% of col	33.33%	33.33%	75%	50%

Total	3	3	4	10
% of n	30%	30%	40%