Content
Distributions
Distributions
Normal distribution
Chi-square distribution
Student's t distribution
F (variance ratio) distribution
Studentized range (Q)
Spearman's rho distribution
Kendall's tau distribution
Binomial distribution
Poisson distribution
Non-central t distribution
Gamma
Copyright © 1990-2006 StatsDirectLimited, all rights reserved
Download a free 10 day StatsDirect trial
Probability distributions. 1
Normal distribution. 3
Chi-square distribution. 5
Student's t distribution. 6
F (variance ratio) distribution. 8
Studentized range (Q) distribution. 9
Spearman's rho & Hotelling-Pabst T distribution. 9
Kendall's tau distribution. 10
Binomial distribution. 11
Poisson distribution. 12
Non-central t distribution. 13
Gamma distribution. 14
Gamma distribution. 15
This section covers commonstatistical probability distributions. Robust, reliable algorithms have beenemployed to provide a high level of accuracy. For practical purposes, however,the P values given with hypothesis tests throughout StatsDirectare displayed to four decimal places (or the number you specify in Optionssection of the Analysis menu).
·Normal
·Chi-square
·Student'st
·F(variance ratio)
·Studentizedrange Q
·Spearman'srho
·Kendall'stau
·Binomial
·Poisson
·Non-centralt
Menu location: Analysis_Distributions.
PROBABILITY DISTRIBUTIONS
Probability is a concept thathelps us predict the chance of something happening (an outcome) based uponknowledge of how this type of outcome behaves mathematically. In mathematicallanguage, an outcome is described in terms of a random variable.
A random variable can take on differentvalues which represent different outcomes, e.g. blood pressure readings. Bloodpressure can be thought of in infinitely small units of measurement where thesteps between the units are so small that they become continuous, this is anexample of a continuous random variable.
If a variable can not take on aninfinite number of sub-divisions of values then it is discrete. Discreterandom variables take on discrete outcomes such as the number of times anasthmatic patient has been admitted to hospital.
Consider the pattern of values ofan outcome measured many times in a population. If you plot all of the valuesof this outcome on a histogram chart then you are likely to see that thehistogram takes on a similar shape each time you plot observations from a largerandom sample from the population. With a continuous random variable you candraw a curve around the histogram because it is possible to have valuesin-between any that are measured. With a discrete variable, however, there area pre-defined number of values that can be measured. If there are only a fewdiscrete values in the population then your histogram will have wide bars withdefinite steps between them.
Now comesthe all important linking concept: probability distribution. The peaksin histogram plots show that some values occur more frequently than others. Themost commonly occurring values are those that have the highest probability ofbeing observed when you take a random sample from the population of interest.
def. A probability distribution of a random variable is a table, graphor mathematical expression giving the probabilities with which the randomvariable takes different values.
Description of this concept innumbers involves more thought about populations and samples. Consider a graphof probability (P) plotted against the value of outcome (x):
·A probabilitydistribution would include all possible values for x.
·The sum of P forall possible values of x is defined as 1.
·For discretevariables this is literally a simple summation but for continuous variables thenumber of possible values of x is infinite so we use integration to estimatethe area under the curve. This area is 1 for the total curve.
Now considerone value of x:
·You can use theprobability distribution for x to estimate the chance of observing that x atrandom in the population.
·For discretedistributions we do literally calculate P.
·For continuous distributions we use a partial area under the curveor probability density function which represents the probability that x liesbetween 2 specified values.
Calculated probability (P) valuesare associated with statistical tests. A test statistic calculated in astatistical test is compared with its probability distribution. The P valuederived from this comparison is then used to support the researcher's decisionto accept or refute the test hypothesis with an accepted level of certainty.
P values and confidence intervalscan give a false sense of security. P values say nothing about the assumptionsof your test. Confidence intervals give a more realistic representation of atest result but they do NOT compensate for a test used with invalidassumptions. Please read the help text regarding assumptions when you are usingany of the hypothesis tests in StatsDirect.
Discrete distributions: e.g.Binomial, Poisson
Continuous distributions: e.g. Normal, Chi-square,Student's t, F
If you need more informationabout probability and sampling theory then please consult one of the generaltexts listed in the reference section.
Copyright © 1990-2006 StatsDirectLimited, all rights reserved
Download a free 10 day StatsDirect trial
Menu location: Analysis_Distributions_Normal.
The standard normal distributionis the most important continuous probability distribution. It was firstdescribed by De Moivre in 1733 and subsequently bythe German mathematician C. F. Gauss (1777 - 1885). StatsDirectgives you tail areas and percentage points for this distribution (Hill, 1973; Odeh andEvans, 1974; Wichura, 1988; Johnson and Kotz, 1970).
Normal distributions are a familyof distributions with a symmetrical bell shape:-
The area under each of the curvesabove is the same and most of the values occur in the middle of the curve. Themean and standard deviation of a normal distribution control how tall and wideit is.
The standard normal distribution(z distribution) is a normal distribution with a mean of 0 and a standarddeviation of 1. Any point (x) from a normal distribution can be converted tothe standard normal distribution (z) with the formula z = (x-mean) / standarddeviation. z for any particular x value shows how manystandard deviations x is away from the mean for all x values. For example, if 1.4m is the height of a school pupil wherethe mean for pupils of his age/sex/ethnicity is 1.2m with a standard deviation of 0.4 then z =(1.4-1.2) / 0.4 = 0.5, i.e. the pupil is half a standard deviation from themean (value at centre of curve).
The diagram above shows the bellshaped curve of a normal (Gaussian) distribution superimposed on a histogram ofa sample from a normal distribution. Many populations display normal or nearnormal distributions. There are also many mathematical relationships betweennormal and other distributions. Most statistical methods make "normalapproximations" when samples are sufficiently large.
Central Limit Theorem
In order to understand why"normal apprwww.lindalemus.com/rencai/oximations" can be made, consider the central limittheorem. The central limit theorem may be explained as follows: If you take asample from a population with some arbitrary distribution, the sample meanwill, in the limit, tend to be normally distributed with the same mean as thepopulation and with a variance equal to the population variance divided by thesample size. A histogram plot of the means of many samples drawn from onepopulation will therefore form a normal (bell shaped) curve regardless of thedistribution of the population values.
Technical Validation
The tail area of the normaldistribution is evaluated to 15 decimal places of accuracy using the complementof the error function (Abramowitz andStegun, 1964; Johnson and Kotz, 1970). The quantilesof the normal distribution are calculated to 15 decimal places using a methodbased upon AS 241 (Wichura,1988).
z0.001 = -3.09023230616781
Lower tail P(z=-3.09023230616781) = 0.001
z0.25 = -0.674489750196082
Lower tail P(z=0.674489750196082) = 0.25
z1E-20 = -9.26234008979841
Lower tail P(z=-9.26234008979841) = 9.99999999999962E-21
The first two StatsDirectresults above agree to 15 decimal places with the reference data of Wichura (1988). The extreme value (lower tail P of 1E-20)evaluates correctly to 14 decimal places.
Function Definition
Distribution function, F(z), of a standard normal variable z:
StatsDirect calculates F(z) from the complement of the error function (errc):
Copyright © 1990-2006 StatsDirectLimited, all rights reserved
Download a free 10 day StatsDirect trial
Menu location: Analysis_Distributions_Chi-Square.
A variable from a chi-squaredistribution with n degrees of freedom is the sum of the squares of nindependent standardnormal variables (z).
[chi(Greek c) is pronounced ki as in kind]
A chi-square variable with onedegree of freedom is equal to the square of the standard normal variable. Achi-square with many degrees of freedom is approximately equal to the standardnormal variable, as the centrallimit theorem dictates.
The so called "linearconstraint" property of chi-square explains its application in manystatistical methods: Suppose we consider one sub-set of all possible outcomesof n random variables (z). The sub-set is defined by a linearconstraint:
- where aand k are constants. Here the sum of the squares of z follows achi-square distribution with n-1 degrees of freedom. If there are mlinear constraints then the total degrees of freedom is n-m. The numberof linear constraints associated with the design of contingency tables explainsthe number of degrees of freedom used in contingency table tests (Bland, 2000).
Another important relationship ofchi-square is as follows: the sums of squares about the mean for a normalsample of size n will follow the distribution of the samplevariance times chi-square with n-1 degreesof freedom. As the expected value of chi-square is n-1 here, the samplevariance is estimated as the sums of squares about the mean divided by n-1.
Technical Validation
StatsDirect calculates the probability associated with a chi-square randomvariable with n degrees of freedom, for this a reliable approach to theincomplete gamma integral is used (Shea, 1988).Chi-square quantiles are calculated for ndegrees of freedom and a given probability using the Taylor series expansion of Best and Roberts(1975) when P £ 0.999998 and P ³0.000002, otherwise a root finding algorithm is appliedto the incomplete gamma integral.
Stat Direct agrees fully with allof the double precision reference values quoted by Shea (1988).
Function Definition
The distribution function F(x) ofa chi-square random variable x with n degrees of freedom is:
G(*) is the gamma function:
,x>0
Copyright © 1990-2006 StatsDirectLimited, all rights reserved
Download a free 10 day StatsDirect trial
Menu location: Analysis_Distributions_Student's t.
Student's t is the distributionwith n degrees of freedom of
- wherez is the standardnormal variable and c² is a chi-squarerandom variable with n degrees of freedom.
When n is large the distributionof t is close to normal. The largest differences betweenstandard normal and t distributions occurs in the tails which is themost important area in statistical tests. You can see from the diagram belowthat a t distribution with fewer degrees of freedom has more of its values inthe tails:
For samples from a normaldistribution, the ratio of the mean and its standard error follow a t distribution.The number of degrees of freedom used should be equal to the sample size minusthe number of estimated parameters. In t tests the estimated parameter is thestandard deviation about the mean, therefore, degrees of freedom are n-1.
This family of distributions isassociated with W. S. Gosset who, at the turn of thecentury, published his work under the pseudonym Student.
Technical Validation
StatsDirect uses the relationship between Student's t and the beta distributionin its calculation of tail areas and percentage points for t distributions. Soper's reduction method is used to integrate theincomplete beta function (Majumder andBhttacharjee, 1973a, 1973b; Morris, 1992). A hybrid of conventional rootfinding methods is used to invert the function. Conventional root findingproved more reliable than some other methods such as Newton Raphsoniteration on Wilson Hilferty starting estimates (Berry et al., 1990;Cran et al., 1977). StatsDirect does not use thebeta distribution for the calculation of t percentage points when there aremore than 60 degrees of freedom (n) (Hill 1970), whenn is 2 then t=sqr(2/(P*(2-P))-2) and when n is 1 thent=cos((P*p)/2)/sin((P*p)/2).
Function Definition
The distribution function of a tdistribution with n degrees of freedom is:
G(*) is the gamma function:
,x>0
A t variable with n degrees offreedom can be transformed to an F variable with 1 and n degrees of freedom ast²=F. An F variable with n1 and n2 degrees offreedom can be transformed to a beta variable with parameters p=n1/2 and q= n2/2 asbeta= n1F(n2+ n1F). The beta distribution withparameters p and q is:
Copyright © 1990-2006 StatsDirectLimited, all rights reserved
Download a free 10 day StatsDirect trial
Menu location: Analysis_Distributions_F (Variance Ratio).
Fisher-SnedecorF is the distribution of the ratio of two independent estimates of variance.The variance estimates should be made from two samples from a normal distribution.The size of these two samples is reflected in two degrees of freedom.
F has two degrees of freedom, n(numerator) and d (denominator), because it represents the distribution of twoindependent chi-square variables each divided by its degrees of freedom:
F can be used to compare twoestimates of variance but it is mainly used to compare groups of means andto examine the combined effect of several factors (ways of grouping data) in analysisof variance. For a simple one way analysis of variance (one factor betweensubjects) there is one grouping of observations into k groups, here n = k-1 andd = N-k where N is the total number of subjects observed. The denominator degrees of freedom, d, is sometimes referred to as the errordegrees of freedom. When degrees of freedom are small, larger F values arerequired to reach significance:
When there is one numeratordegree of freedom and d denominator degrees of freedom then F is equal to the square of Student's t with d degrees of freedom.Both F and t are related mathematically by the beta function.
Technical Validation
StatsDirect calculates tail areas and percentage points for given numerator anddenominator degrees of freedom. Soper's reductionmethod is used to integrate the incomplete beta function (Majumder andBhttacharjee, 1973a, 1973b; Morris, 1992). A hybrid of conventional rootfinding methods is used to invert the function. Conventional root findingproved more reliable than some other methods such as Newton Raphsoniteration on Wilson Hilferty starting estimates (Berry et al., 1990;Cran et al., 1977).
Function Definition
An Fvariable with n1 and n2 degrees of freedom can be transformed to a betavariable with parameters p=n1/2 and q= n2/2 as beta= n1F(n2+ n1F). The beta distribution withparameters p and q is:
G(*) is the gamma function:
,x>0
Copyright © 1990-2006 StatsDirectLimited, all rights reserved
Download a free 10 day StatsDirect trial
Menu location: Analysis_Distributions_Studentized Range (Q).
The Studentizedrange, Q, is a statistic due to Newman (1939)and Keuls (1952)that is used in multiple comparison methods. Q is defined as the range of meansdivided by the estimated standard error of the mean for a set of samples beingcompared. The estimated standard error of the mean for a group of samples isusually derived from analysis of variance.
Technical Validation
StatsDirect calculates tail areas and percentage points for a given number ofsamples and sample sizes (Copenhaver andHolland, 1988). These calculations are highly iterative therefore you maynotice a delay during their computation. Other software and web sites mightproduce different results to StatsDirect; this islikely to be due to their use of less precise or more restricted algorithms (Gleason, 1999; Lundand Lund, 1983; Royston, 1987).
Copyright © 1990-2006 StatsDirectLimited, all rights reserved
Download a free 10 day StatsDirect trial
Menu location: Analysis_Distributions_Spearman's rho.
Given a value for the Hotelling-Pabst test statistic (T) or Spearman's rho (r) this function calculates the probability of obtaining a valuegreater than or equal to T.
For two rankings (x1,x2...xn and y1,y2...yn) of n objects without ties:
T is related to the Spearmanrank correlation coefficient (r) by:
Technical Validation
Probabilities are calculated bysummation across all permutations when n = 10 and by and Edgeworthseries approximation when n > 10 (Best and Roberts,1975). The exact calculation employs a corrected version of the Best andRoberts (1975) algorithm. The Edgeworth seriesresults for n > 10 are accurate to at least four decimal places.
The inverse is calculated byfinding the largest value of T that gives a calculated probability (using themethod above) closest to but not less than the P value entered.
Copyright © 1990-2006 StatsDirectLimited, all rights reserved
Download a free 10 day StatsDirect trial
Menu location: Analysis_Distributions_Kendall's tau.
Given a value for the teststatistic (S) associated with Kendall'stau (t) this function calculatesthe probability of obtaining a value greater than or equal to S for a givensample size.
Consider two samples, x and y,each of size n. The total number of possible pairings of x with y observationsis n(n-1)/2. Now consider ordering the pairs by the xvalues and then by the y values. If x3 > y3 when ordered on both x and ythen the third pair is concordant, otherwise the third pair is discordant. S isthe difference between the number of concordant (ordered in the same way, nc) and discordant (ordereddifferently, nd) pairs.
Tau (t) is related to S by:
If there are tied (same value)observations then tb is used:
- where ti is the number of observations tied at a particular rankof x and u is the number tied at a rank of y. When there are no ties tb = t.
This function does not calculateprobabilities for tb.
Technical Validation
Probabilities are calculated bysummation across all permutations when n £ 50 and by an Edgeworth series approximation when n > 50 (Best, 1974). Thetwo samples are assumed to have been ranked without ties.
The inverse is calculated byfinding the largest value of S that gives a calculated upper tail probability(using the method above) closest to but not less than the P value entered.Please note that the results may differ slightly from tables in textbooksbecause StatsDirect calculates the inverse of Kendall's statistic more accurately than the routinesused to calculate Best's widely quoted 1974 table.
Copyright © 1990-2006 StatsDirectLimited, all rights reserved
Download a free 10 day StatsDirect trial
Menu location: Analysis_Distributions_Binomial.
A binomial distribution occurswhen there are only two mutually exclusive possible outcomes, for example theoutcome of tossing a coin is heads or tails. It is usual to refer to oneoutcome as "success" and the other outcome as "failure".
If a coin is tossed n times thena binomial distribution can be used to determine the probability, P(r) ofexactly r successes:
Here p is the probability ofsuccess on each trial, in many situations this will be 0.5, for example thechance of a coin coming up heads is 50:50/equal/p=0.5. The assumptions of abovecalculation are that the n events www.lindalemus.com/wsj/are mutually exclusive, independent andrandomly selected from a binomial population. Note that !is a factorial and 0! is 1 asanything to the power of 0 is 1.
In many situations theprobability of interest is not that associated with exactly r successesbut instead it is the probability of r or more (³ r) or at most r (£ r) successes.Here the cumulative probability is calculated:
The mean of a binomialdistribution is p and its standard deviation is sqr(p(1-p)/n). The shape ofa binomial distribution is symmetrical when p=0.5 or when n is large.
When n is large and p is close to0.5, the binomial distribution can be approximated from the standard normaldistribution; this is a special case of the centrallimit theorem:
Please note that confidenceintervals for binomial proportions with p = 0.5 are given with the sign test.
Technical Validation
StatsDirect calculates the probability for exactly r and the cumulativeprobabilities for (³, £) r successes in n trials. The gamma function is a generalised factorial function and it is used to calculateeach binomial probability. The core algorithm evaluates the logarithm of thegamma function (Codyand Hillstrom, 1967; Abramowitz and Stegun 1972; Macleod, 1989) to thelimit of 64 bit precision.
G(*) is the gamma function:
,x>0
G(1)=1
G(x+1)=xG(x)
G(n)=(n-1)!
Copyright © 1990-2006 StatsDirectLimited, all rights reserved
Download a free 10 day StatsDirect trial
Menu location: Analysis_Distributions_Poisson.
A Poisson distribution is thedistribution of the number of events in a fixed time interval, provided thatthe events occur at random, independently in time and at a constant rate.
The event rate, µ, is the numberof events per unit time. When µ is large, the shape of a Poisson distributionis very similar to that of the standardnormal distribution. The change in shape of a Poisson distribution withincreasing n is very similar to the equivalent binomialdistribution. Convergence of distributions in this way can be explained bythe centrallimit theorem.
Consider a time interval dividedinto many sub-intervals of equal length such that the probability of an eventin a sub-interval is small and the probability of more than one event isnegligible. If the probability of an event in each sub-interval is the same asand independent of that probability for other sub-intervals then nsub-intervals can be thought of as n independent trials. This is why Poisson distributionsare closely related to binomial distributions.
Both the mean and variance of aPoisson distribution are equal to µ. The probability of r events happening inunit time with an event rate of µ is:
The summation of this Poissonfrequency function from zero to r will always be equal to one as:
Analysis of mortality statisticsoften employs Poisson distributions on the assumption that deaths from mostdiseases occur independently and at random in populations (see Poissonrate confidence interval). Other common uses of Poisson are in Physics tomodel radioactive particle emission and in insurance companies to modelaccident rates.
Technical Validation
StatsDirect calculates cumulative probabilities that (£, ³, =) r random events arecontained in an interval when the average number of such events per interval isµ. The gamma function is a generalised factorialfunction and it is used to calculate each Poisson probability (Knusel, 1986).The core algorithm evaluates the logarithm of the gamma function (Cody and Hillstrom,1967; Abramowitz and Stegun 1972; Macleod, 1989) to the limit of 64-bitprecision. The inverse is found using a bisection algorithm to one order ofmagnitude larger than the limit of 64-bit precision.
G (*) is the gamma function:
,x>0
G(1)=1
G(x+1)=xG(x)
G (n)=(n-1)!
Copyright © 1990-2006 StatsDirectLimited, all rights reserved
Download a free 10 day StatsDirect trial
Menu location: Analysis_Distributions_Non-Central t.
Non-central t (T) represents afamily of distributions which are shaped by n degrees of freedom and anon-centrality parameter (d).
Non-central t may be expressed interms of a normal and a chi-square distribution:
- wherez is a normal variable with mean d and variance 1 and c² is a chi-square randomvariable with degrees of freedom (Owen, 1965).
In the field of meta-analysissome effect size statistics display a non-central t distribution. This functionmay therefore be useful in hypothesis testing and confidence intervalconstruction for effect sizes (Greenland andRobins, 1985).
Technical Validation
StatsDirect evaluates the cumulative probability that a t random variable isless than or equal to a given value of T with n degrees of freedom andnon-centrality parameter d (Lenth,1989; Owen, 1965; Young and Minder, 1974; Thomas, 1979; Chou, 1985; Boys 1989;Geodhart and Jansen, 1992). The inverse of T is found by conventional rootfinding methods to six decimal places of accuracy.
Copyright © 1990-2006 StatsDirectLimited, all rights reserved
Download a free 10 day StatsDirect trial
Menu location: Analysis_Distributions_Gamma.
The gamma distribution dependsupon two parameters; A (the shaping parameter) and B (the scaling parameter).
The gamma density function is:
,t ³0
The gamma function G (*) is:
,x > 0
If A is an integer then G(A)=(A-1)! wherex! is factorial x.
When A = 1 this gives anexponential distribution. If you vary A then the shapeof the distribution will change. Changing B does not affect the shape of thedistribution, just its scale on the x axis.
See randomnumber fill.
Copyright © 1990-2006 StatsDirectLimited, all rights reserved
Download a free 10 day StatsDirect trial
Menu location: Analysis_Distributions_Gamma.
The gamma distribution dependsupon two parameters; A (the shaping parameter) and B (the scaling parameter).
The gamma density function is:
,t ³0
The gamma function G (*) is:
,x > 0
If A is an integer then G(A)=(A-1)! wherex! is factorial x.
When A = 1 this gives anexponential distribution. If you vary A then the shapeof the distribution will change. Changing B does not affect the shape of thedistribution, just its scale on the x axis.
See randomnumber fill.