# BASIC STATISTICS

Descriptive Statistics are
used to summarize or describe a set of data. This page covers
the
basic definitions - mean, mode, median etc. Also Standard Deviation and
the normal distribution, and conversion to Percentile.

## Descriptive Statistics

Most statistics is either descriptive statistic, or in an inductive analysis. Descriptive statistics are calculations based on the data that describe or summarize that data. For example - the mean (arithmetic average). It is assumed that all the data in the sample are related (e.g. height of 20 year olds, maximum day temperature in Sydney for March, etc). Continuous data (e.g. tonnes per hour of coal on a conveyor) are more often used in a trend analysis than descriptive statistics. Trend analysis is usually used to pick up whether the readings are in danger of drifting out of spec, so this will only work if the readings are in consecutive order.So for Descriptive Statistics, the data is a set of readings or measurements taken on a group of related items in no particular order - like a large barrel of parts produced in one batch (but we don't know the order they were made).

Summary

## Basic Terms

Examples below are based on this small set of observations/readings/samples/values: {34, 27, 45, 55, 22, 27}The set can also be called the population/group/sample/data.

Description | Formula | Example | Excel |

The ith value | x_{i} |
x_{3}
= 45 |
=A3 |

Count = Number of values | n | 6 | count(A1:A6) |

Maximum = highest value | x_{max} |
55 | max(A1:A6) |

Minimum = lowest value | x_{min} |
22 | min(A1:A6) |

Range = Maximum - Minimum | x_{max
}- x_{min} |
55 - 22 = 33 | |

Mean
=
Sum of all
values / Count Common symbols for mean are; | (34, 27, 45, 55, 22, 37) / 6 = 36.5 | average(A1:A6) | |

Median = Middle number when listed in order. (or average of middle 2) | 22,
27, 27, 34,
45, 55 = (27 + 34) / 2 = 30.5 |
median(A1:A6) | |

Mode = most frequent value or range of values (frequency diagram) | mode(22, 27, 27, 34, 45, 55) = 27 | mode(A1:A6) |

## Histograms (Frequency Distribution)

A histogram is a graph of frequencies shown as bars. Each bar or "bin" is a certain range. These intervals (or bands, or bins) are generally of the same size, adjacent and non-overlapping.The choice of bin size is important. A histogram needs at least 20 or 30 measurements or the bars will be too crude to make sense. Ideally, a large number of measurements will allow the bins to be fairly small (narrow) which will give a smoother and more accurate distribution. With fewer measurements the bins must be larger, otherwise every bin might be just 0,1 or 2!

### Example

Heights of 31 Black Cherry trees. For practical reasons we chose this relatively small sample.Max = 87, min = 70, range = 24, average = 76, standard deviation = 6.268 feet

If this sample is reliable (which it probably isn't), we would expect 68% of Black Cherry trees to be 76 feet +/- 6.3 ft tall.

## Standard Deviation

Standard Deviation is a measure of the spread or dispersion of the values. The standard deviation is a statistic that tells you how tightly all the various examples are clustered around the mean in a set of data.The calculation of standard deviation is actually the root mean square (RMS) of the deviation of the values from the mean.

This can be calculated either for the whole population (population standard deviation), or just a sample (sample standard deviation). Sample SD is commonly used when there are too many items to measure, such as a small sample from a large batch of parts. It is also suitable for polls, market research and experiments where it is not possible to measure the whole population. Sample SD is the default.

Standard Deviation
(Sample)Excel: stdev() |
Standard Deviation (Population)Excel: stdevp() |

X = Individual value

M = Mean of all values

N = Sample size (Number of values)

Common symbols for Standard Deviation are: SD or S or

Variance is the square of Standard Deviation.

Variance = S

^{2}

### Examples

Standard Deviation (Population)Example: To find the Standard deviation of {34, 27, 45, 55, 22, 27}.

1. Calculate the mean

2. Find deviation from mean.

3. Square the deviation

4. Total these squares

5. Divide by N

6. Take square root

X | M | X-M | (X-M)^{2} |

34 | 35 | -1 | 1 |

27 | 35 | -8 | 64 |

45 | 35 | 10 | 100 |

55 | 35 | 20 | 400 |

22 | 35 | -13 | 169 |

27 | 35 | -8 | 64 |

TOTAL | 798 | ||

divide n | 133 | ||

sq root | 11.53 |

Standard Deviation (Sample)

Example: To find the Standard deviation of {34, 27, 45, 55, 22, 27}.

1. Calculate the mean

2. Find deviation from mean.

3. Square the deviation

4. Total these squares

5. Divide by (n-1)

6. Take square root

X | M | X-M | (X-M)^{2} |

34 | 35 | -1 | 1 |

27 | 35 | -8 | 64 |

45 | 35 | 10 | 100 |

55 | 35 | 20 | 400 |

22 | 35 | -13 | 169 |

27 | 35 | -8 | 64 |

TOTAL | 798 | ||

divide n-1 | 159.6 | ||

sq root | 12.63 |

This sample SD means these values are 6 readings from a batch, and we are trying to get a rough idea about the whole batch.

How to do Standard Deviation on calculator (Casio FX82 AU)

Page 13 of http://www.casio.edu.shriro.com.au/produ...

**Example**

1. Set statistics (MODE then press 2) and press 1 for variable statistics.

*If the calculator has
a frequency column showing turn the frequency off by pressing SHIFT, MODE (SET UP), REPLAY down, 3 for statistics then 2 for ‘off‛.*

2. Enter the data. Press 34 then = 27 = 45 = ... = 27.

*If you
make a mistake, don‛t worry! Simply scroll up to the wrong score and type the
correct value over it. If you‛ve left a value out just put it at the end. If you‛ve put
in an additional score that you don‛t require, highlight the score and press o. The
incorrect score will be removed and the following scores moved up one position.*

3. Press AC to finish data entering.

*Don‛t panic when the scores disappear! The data entering screen will disappear but can be brought back if required.*

4.
Press SHIFT
1 to get menu 1:Type 2:Data 3:Sum 4:Var 5:MinMax... Press **4** for Var...

5. Another menu appears 1:n 2:x 3:sx: 4:sx Press 3 to get Population Standard Deviation. 11.53

## Normal (Gaussian) Distribution

A perfectly smooth histogram is usually just called a frequency distribution, or simply a distribution curve. It can either be generated by a very large number of measurements, or by approximating (smoothing) out a rougher histogram.

The graph above is called the Normal Distribution (or Gaussian distribution). It is perfectly balanced - where the mean is exactly in the middle (median) and it is also the highest or most common value (mode). The curve has a specific bell shape that might be wider (more spread out) or narrower (closer to the mean). This type of frequency distribution (or very similar) occurs often in real life measurements of large populations. e.g. measurement of stature.

One standard deviation is a specific distance from the mean μ. By including every value within 1σ of the mean you will have 68.2% of the population. Mathematically, one standard deviation is μ ± σ, where μ is the arithmetic mean. About 95% of the values are within two standard deviations (μ ± 2σ), and about 99.7% lie within 3 standard deviations (μ ± 3σ). So common-ness is measured in standard deviations.The percentage within bounds are defined by the formula: %perc=erf(n σ / √2) * 50% + 50%

### Z -
Score

A
Z-score
is how many standard
deviations a particular score is from the mean.So a z-score of 1 is 1 above the average μ.

## Conversion to Percentile

A percentile is the value of a variable below which a certain percent of observations fall. So the 20th percentile is the value (or score) below which 20 percent of the observations may be found.The 25th percentile is also known as the first quartile (Q1); the 50th percentile as the median or second quartile (Q2); the 75th percentile as the third quartile (Q3).

The average should be the 50th percentile. (Likewise the Median and Mode in a normal distribution as shown below)

### Cumulative Probability

This table shows 310 intervals (pink area to the right of mean) which are the cumulative probability between the Mean and the Z-Score.For positive z scores, Percentile = 0.5 + p(z)

For negative z scores, Percentile = 0.5 - p(z)

Example:

In a normal distribution of weights of filled cement bags, the sample average (mean) is 20.04 kg and the sample standard deviation is 18g. If the bags are sold as 20kg, what percentage of bags are expected to be underweight?

μ = 2040, σ = 18,

To be underweight requires (2040 g - 2000 g) = 40g

Number of σ 's = 40/18 = 2.222

This is a negative z score because it is UNDERWEIGHT by 2.222 σ

Reading from the table above, to 2 decimal places, p(2.22)= 48.68

So percentile = 50 - 48.68 = 1.32%

This says 1.32% of bags will be underweight, and 98.68% will be overweight.

Using this calculator, enter the top 3 values then click "Compute x" button to find the value x.

Percentile is the probability that gives the x you are looking for.

Example (Cement bags)

μ = 2040, σ = 18, Probability = ?, x=2000

Keep entering Probability (0=0% to 1=100%) until you get an x of 2000

Solution: Probability = 0.013171 = 1.3171%

This says 1.32% of bags will be underweight, and 98.68% will be overweight.

This calculator is not quite matching the Excel function NORMSDIST() which gives 1.320933881% , but rounding to the accuracy of the table (2 decimal places) it works fine.

## Six Sigma

Six Sigma is a quality improvement strategy.Sigma means Standard Deviation, so Six Sigma is 6 standard deviations from the mean. Well, sort of. The term "six sigma process" comes from the notion that if one has six standard deviations between the process mean and the nearest specification limit, as shown in the graphic, there will be practically no items that fail to meet specifications. (LSL = Lower Spec. Limit, USL = Upper...)

Using the Normal (Gaussian) distribution, calculation of 6 standard deviations from the mean actually gives only 1 part in 507 million outside the limits!

zσ | percentage within CI | percentage outside CI | ratio outside CI |
---|---|---|---|

1σ | 68.2689492% | 31.7310508% | 1 / 3.1514871 |

1.645σ | 90% | 10% | 1 / 10 |

1.960σ | 95% | 5% | 1 / 20 |

2σ | 95.4499736% | 4.5500264% | 1 / 21.977894 |

2.576σ | 99% | 1% | 1 / 100 |

3σ | 99.7300204% | 0.2699796% | 1 / 370.398 |

3.2906σ | 99.9% | 0.1% | 1 / 1000 |

4σ | 99.993666% | 0.006334% | 1 / 15,788 |

5σ | 99.9999426697% | 0.0000573303% | 1 / 1,744,278 |

6σ | 99.9999998027% | 0.0000001973% | 1 / 506,800,000 |

7σ | 99.999 999 999 7440% | 0.0000000002560% | 1 / 390,600,000,000 |

Why?

Sigma level | DPMO | Percent defective | Percentage yield |
---|---|---|---|

1 | 691,462 | 69% | 31% |

2 | 308,538 | 31% | 69% |

3 | 66,807 | 6.7% | 93.3% |

4 | 6,210 | 0.62% | 99.38% |

5 | 233 | 0.023% | 99.977% |

6 | 3.4 | 0.00034% | 99.99966% |

7 | 0.019 | 0.0000019% | 99.9999981% |

Experience has shown that in the long term, processes deteriorate for a number of reasons. The mean can drift, and the short term standard deviations can expand over time. To account for this real-life increase in process variation over time, a 1.5 sigma shift is introduced into the calculation. So setting up a 6 Sigma process at the start should provide at least a 4.5 Sigma process in the long term.

Six-Sigma Process with +1.5s Shift vs. Centered Three-Sigma Process

Despite a shift of 1.5σ in the long term mean (target), a 6σ process has only 3.4 ppm defective (4.5σ), compared to a more typical 3σ process, with a failure rate of 2700 ppm.

The Six Sigma strategy uses standard statistical tools, but they are employed in a systematic project-oriented fashion through the define, measure, analyze, improve and control (DMAIC) cycle. Plus a bunch of other acronyms we don't have time for.

Criticisms

- Nothing new, and a risk of creating a cottage industry of training and certification (yet another industrial management fad).
- Some claim that a strict Six Sigma implementation can stifle creativity by encouraging incremental rather than large innovations.
- Six Sigmas is arbitrary. Why not 5 or 7 Sigma? The 3.4 ppm (which is really 4.5 Sigma) is industry specific. A pacemaker process might need higher standards, a direct mail process lower.
- 1.5 Sigma shift is arbitrary. Why not 0.5 or 2 Sigma shift? It also gives an overstated appearance (6 Sigma) when it is really only a 4.5 Sigma.