As mentioned previously, a frequency distribution contains all of the observations for a particular
sample, which we refer to as the raw data. Only on rare occasions do we present the raw data. The
purpose of descriptive statistics is to provide a means of summarizing the information contained
within a frequency distribution. The two most important pieces of information that need to be
provided for any distribution are the central tendency of the distribution, and the dispersion of
the distribution. Measures of central tendency essentially describe the position of the distribution
on the X-axis (the value of the variable being measured), whereas measures of dispersion describe
how spread out the observations are along the X-axis. In a number of cases the shape of the
distribution, specifically the degree of symmetry, will also be important to describe.
(Chapter 3 in Zar, 2010)
Where a particular distribution of data is located on the X-axis (which represents the values of the
variable being measured) is summarized by reference to some value associated with the
approximate center of the distribution. The 3 standard measures of central tendency are mean,
median, and mode.
The mean simply is the arithmetic average of the observations, and is a summary statistic that we
all are familiar with. For this reason, we will use the formula for calculation of the sample mean to
indicate some of the notation that we will be using throughout the semester:
We will use Y to denote the value of an observation. The total number of observations, i.e., the
sample size, will be denoted as n. When we wish to identify a specific observation, we will use a
subscript. For example, Y4 would indicate the 4th observation. ∑Y is the notation that we will use
for the sum of all the obervations (Y1+Y2+Y3+…Yn). In this case a subscript for the Y would
indicate a particular group of observations, e.g., ∑Ycontrol would indicate the sum of all the
observations for the control group. The Y with the bar over it is the sample mean, and it is an
estimate, based on our sample, of the actual mean of the statistical population. We can express this
The population mean is represented by the symbol μ (lower case Mu in the Greek alphabet), and
the caret (^) over the top of it indicates that it is an estimate. Thus, the preceding formula can be
read as “the sample mean is an estimate of the population mean”.
When making estimates, we are far more concerned with accuracy (proximity to the actual value)
than we are with precision (proximity of estimates to one another). Of critical importance to
obtaining accuracy in our estimates is the use of estimators that are unbiased. An unbiased
estimator is as likely to overestimate as it is to underestimate, whereas a biased estimator will tend
to consistently overestimate, or consistently underestimate. Write that down…you are about to be
asked to evaluate bias.
We can use our newfound skills at reading frequency distributions (if you are feeling less than
skillful, review last week’s material) to examine the behavior of the sample mean as an estimate of
the population mean. The following graph was produced by drawing (at random) 1000 samples of
50 observations each from a statistical population where μ=10. This population mean of 10 was
subtracted from each of the sample means (such that a value of 0 would indicate that the sample
mean and population mean were identical) calculated from the 1000 samples to produce the
Note: These data were produced as the “smean” object in this R program
Question 1: From these data, does the sample mean appear to be an unbiased estimate of the
population mean? Justify your answer.
The other 2 measures of central tendency, the median and the mode, will return values similar to
the mean for distributions that are symmetrical, like the one above, but can convey different, and
sometimes important, information when applied to asymmetrical distributions. The median is the
middle observation when the observations are aligned in ascending (or descending) order by the
magnitude of their values. This can be a useful measure, because 50% of your observations are
above that value, and 50% of your observations fall below that value. The mode is the observation
that occurs the most frequently, i.e., the peak of the frequency distribution.
For distributions that are symmetrical, the mean, median, and mode should converge on the same
value. The distribution that follows displays observations of feeding rates of fruit fly (Drosophila
melanogaster) larvae, measured by counting the number of times the feeding apparatus
(cephalopharyngeal sclevites) contracted over the period of a minute.
The distribution of feeding rates is (more or less) symmetrical, resulting in the sample mean,
sample median, and sample mode all being approximately 85 contractions per minute. For
symmetrical distributions, all 3 measures convey basically the same information. When
distributions are asymmetrical, you have to carefully consider what information you wish to
convey when choosing a measure of central tendency. The following distribution was created by
examining the distribution of Vica sp. (vetch), a twining legume growing in the lawn outside of the
Pacer Commons dorm on campus, by counting the number of individuals present in a series of 0.5
As you can see, this distribution exhibits a positive (right) skew, resulting in different values for
the sample mean, sample median, and sample mode. Reporting the sample mean will give a value
that does not occur frequently as an observation, and so you would have to weigh whether
frequency is a more important piece of information than the position of the distribution for the
question that you are addressing.
In some instances, a distribution may be suggestive of more than one coherent group of
observations, such as the distribution of exam grades shown below:
In such cases, the sample mean and median are poor indications of the pattern, and one should
report both modes (this type of distribution is referred to as “bimodal”).
While it is important to recognize the existence and potential uses for other measures of central
tendency, it will be a rare occasion when a measure of central tendency other than the sample mean
(Chapter 4 in Zar, 2010)
While the position of a distribution on the X-axis is a critical piece of information to convey, the
relevance of that measure depends on how wide that distribution is, i.e., the amount of variation in
that variable, especially when making comparisons between or among distributions. Measures of
dispersion are indices of how spread out the observations are along the X-axis.
The simplest measure of dispersion is the range, which involves reporting the lowest and highest
observation, or the difference between them. This measure is very sensitive to outliers, which are
values that are unusually high or low relative to the other observations. While it is not difficult to
find recommendations for excluding outliers from a set of data, unless it is clear that the
observation is impossible, e.g., a human body temperature of 183 degrees C, or it is known that an
error in measurement occurred, one should always be hesitant to remove such observations (see
section 2.5 in chapter 2 of your text).
The reason that range is sensitive to outliers is that it relies on only 2 of your observations. Clearly
a measure of dispersion that relied on all of your observations would be of more value, and better
justify all the hard work that went into collecting those observations. Our newfound, and in-depth,
understanding of central tendency suggests one possible measure: the average distance of the
observations from the center of the distribution.
The distance of an observation from the sample mean can be calculated by subtracting the sample
mean from the observation as follows:
This value, indicated by a lowercase y, is called a deviate. Intuitively then, the average distance
would be the sum of the deviates, ∑y, divided by the number of observations, n. The problem with
this can be illustrated by examining the following table of quiz scores from 2 separate sections of a
Because the sample mean is the mathematical center of the observations, the sum of the deviates
will always (within rounding error) be equal to zero. The two distributions of quiz scores are
clearly different, but the average deviations will provide no information about these differences.
The solution that we will apply is to square the deviates, making all of the differences positive. The
notation that we will use for a squared deviate will be y2, such that ∑y2 will indicate the sum of
the squared deviates. The sum of the squared deviates is generally referred to as the sum of
squares, and is a value that will figure prominantly in virtually all of the analyses that we will
address, so make sure that you are familiar with how to calculate it, and what it represents.
Applying this to the quiz score data, we can see that the sum of squares (∑y2) better reflects the
differences between the two distributions:
Dividing the sum of the squared deviates by the number of observations (∑y2/n) will give us the
average squared distance of the observations from the mean of the observations. While it should be
intuitive that this is a good measure of the spread of the observations (apart from using squared
distances, which we will address shortly), we cannot lose sight of the fact that the purpose of
deriving this value from a sample is to estimate the same parameter for the statistical population.
Thus, it is important to establish whether calculating this value as described will introduce a bias in
the estimation of the same population parameter.
Calculation of the average squared distance of the observations from the mean for a statistical
population, i.e., using every observation that exists, is a parameter that we call the population
variance, and denote using the symbol: σ2. Unfortunately, using the same calculation from sample
data produces a biased estimate of σ2. The following distribution was produced by taking 1000
random samples from a statistical population with μ=10, and σ2=4, and calculating the average
squared distance of the observations from the mean of the observations for each sample. For each
sample, the population variance (σ2) was subtracted from the average squared distance of the
observations from the sample mean ((∑y2/n)-σ2) to produce the values shown below, such that an
estimate matching the population variance would result in a value of 0:
Note: These data were produced as the “pvd” object in this R program
Question 2: In what direction is the bias demonstrated for the average squared distance of
the observations from the sample mean as an estimate of σ2?
The distribution above suggests that a different calculation must be used to produce an unbiased
estimate of σ2 from sample data. In this instance the correction is a simple one, involving the use
of n-1 in the denominator instead of n. The resulting formula calculates a parameter we call
sample variance, denoted as s2:
In the following graph, the sample variance (s2) calculated from the same series of 1000 random
draws has been plotted as a second series (SS/(n-1)):
Note: The additional series was produced as the “svd” object in this R program
From this distribution, we can see that the correction for sample variance removes the bias from
the estimate. Thus, we will use sample variance (s2) as our best estimate of population variance
The only issue one may take with variance as an indication of the spread of the data, is that the
units are squared relative to the values of the observations and, therefore, the mean. The solution to
this, as you might imagine, is a simple one: simply take the square root of the variance. This
produces a value referred to as the standard deviation, which, for a sample, we denote as s, and
for a population, we denote as σ. Obviously (at least I hope that it is obvious), the square root of a
sample variance (calculated with n-1 as the denominator) will produce a sample standard deviation
(s), and the square root of a population variance (calculated using n as the denominator) will
produce a population standard deviation (σ). Given that we will almost always be working with
samples, we will use sample standard deviation as our estimate of population standard deviation:
Now let’s practice calculating some descriptive statistics for some actual data. Download the Excel
workbook for this week’s exercise HERE.
The first worksheet (birds) contains the data from Example 3.3 in your textbook (p. 25). This will
allow you to double-check your calculations, and the ones Excel does for you.
In cell F15, type the formula to calculate the sample mean for species B as:
Type “mean” in the cell immediately adjacent to the cell containing the sample mean (G15), so that
you don’t become confused later (and so that I am not confused when I review your spreadsheet).
Excel has a function to calculate the median that we will use in cell F16:
Add a label for the median in the adjacent cell as you did for the mean. Note that the value for the
median does not occur among the list of observations. The reason for this is that when there are an
even number of observations, we interpolate between the 2 middle observations to get the median
Now highlight the 2 cells containing the formulae for mean and median, use “Ctrl+c” to copy the
cells, click on cell A15, and use “Ctrl-v” to paste. That feeling of anxiety that you are experiencing
is the result of your conscious (or subconscious) recognition that the sample sizes for the 2 groups
of observations differ. Pasting the formulas results in calculations for species A that include a
blank cell. Use “F2” to verify this.
Remember the words inscribed in friendly letters upon each copy of The Hitchhiker’s Guide to the
Galaxy: “Don’t Panic” (if you have yet to read any of the 5 books in this trilogy, please correct this
alarming oversight at your earliest convenience). For now, let’s take an objective and analytical
approach to examining the consequences of our actions.
Because there are an odd number of observations for the life span of species A, and because these
observations have been sorted in order of ascending value, we can see at a glance that there is, in
fact, a middle observation, and that the value of that observation matches the value of the median
as calculated by Excel. It would appear that the “MEDIAN” function ignores blank cells. We can
verify that the same is true for both the “SUM” and “COUNT” functions by recalculating the mean
using the “AVERAGE” function. Type the following into cell A17:
Not only have we verified that several important functions ignore blank cells, which makes life a
little easier (because we can paste formulas) when dealing with unequal sample sizes, but we also
have verified that the “AVERAGE” function follows the formula that we learned (or more likely
were reminded of) for the sample mean. Feel the tension draining away?
We now are going to work on calculating the variance for both samples. In cell G3, type the
formula to calculate the deviate as:
Having the anchor ($) for the row number allows you to copy the formula down for the remaining
observations while referencing the same cell for the mean. Anchoring the column is not necessary
when the formula is only being copied down, and leaving the column unanchored will allow you to
copy the column in its entirety to calculate the deviates for the observations for species B, because
the reference will match the location of the sample mean. You will be doing yourself a favor if you
take the time to verify this…
In the next column, type the formula to square the deviate as:
Copy the formula down the column. We could have eliminated a step by using a single formula (=
(F3-F$15)^2) in column G, but this is a good reminder of the steps that we discussed (and besides,
I made you put labels where we would need to calculate sums).
It’s time to take the training wheels off. Let’s remind ourselves of the formula for sample variance:
You should be able to calculate the sample variance using the “SUM” function, and the “COUNT”
function. Presumably you can count the observations on your own, but this will be good practice
for when we use larger sample sizes. Just make sure that you use parentheses in your formula to
get the correct order of operations when subtracting 1 from the count, or you will be subtracting 1
from the population variance! You also should be able to repeat these calculations for Species A by
cutting and pasting if you have been careful with your cell references.
Lastly, calculate the sample standard deviation for the two samples. To find the square root of a
value in Excel, the “SQRT” function is used as:
The value can be an actual number, or the cell location for a value. For example, if your sample
variance was located in cell C15, the sample standard deviation could be calculated as:
Make sure to label both sample variance and sample standard deviation clearly on your worksheet,
and remember to save your work!
The second worksheet (fish) contains mass and standard length measurements for bluegill sunfish
(Lepomis macrochirus) and hybrids of bluegill sunfish and green sunfish (Lepomis cynanellus),
collected from a constructed pond in Sedgewick County, Kansas. These measurements have been
used to calculate a “condition factor” (K), which is a ratio of the mass to the cube of the length (in
cm). Fish with a larger value for K will have more mass for a given length. Because green sunfish
have a larger gape, and tend to be more aggressive, there was some question as to whether the
introduction of the hybrids might have a negative effect on the condition of the bluegill sunfishes.
The following graph shows the frequency distributions for the condition factor for both species:
It should be immediately evident that the distributions are similar in terms of their central
tendencies, but differ in the degree of dispersion of the data.
Question 3: Calculate the mean, variance, and standard deviation of both sets of condition
factor data and determine whether these summary statistics reflect the similarities and
differences that can be observed between the two distributions.
Let’s move on to examining symmetry and standard error…
Send comments, suggestions, and corrections to: Derek Zelmer
Symmetry and Kurtosis
As has been mentioned previously, distributions that have an equal number of observations spread
similarly on either side of the mode, are said to be symmetrical. For such distributions, the mean
and median will have close to the same value. One example that we have seen of a symmetrical
distribution was for the difference of sample means from the population mean:
For this distribution the mean is 0.00368, and the median is 0.00342. The 4 distributions we
worked with in our Excel workbook also are symmetrical, as you can see by comparing the sample
means to the sample medians. The degree to which a distribution deviates from symmetry is
referred to as the skewness. We saw an example of a skewed distribution with the vetch data:
In this example, the data show a positive skew (or right skew), with the tail stretching to the right.
There is a rule of thumb that suggests that the position of the mean relative to the median will give
you the direction of the skew. In this instance, that appears to be the case, as the mean is to the
right of the median, but this rule of thumb is not a good one to apply, especially for multimodal
distributions, because it is not consistent enou…
Purchase answer to see full
We value our clients. For this reason, we ensure that each paper is written carefully as per the instructions provided by the client. Our editing team also checks all the papers to ensure that they have been completed as per the expectations.
Over the years, our Written Assignments has managed to secure the most qualified, reliable and experienced team of writers. The company has also ensured continued training and development of the team members to ensure that it keeps up with the rising Academic Trends.
Our prices are fairly priced in such a way that ensures affordability. Additionally, you can get a free price quotation by clicking on the "Place Order" button.
We pay strict attention to deadlines. For this reason, we ensure that all papers are submitted earlier, even before the deadline indicated by the customer. For this reason, the client can go through the work and review everything.
At Written Assignments, all papers are plagiarism-free as they are written from scratch. We have taken strict measures to ensure that there is no similarity on all papers and that citations are included as per the standards set.
Our support team is readily available to provide any guidance/help on our platform at any time of the day/night. Feel free to contact us via the Chat window or support email: email@example.com.
Try it now!
Follow these simple steps to get your paper done
Place your order
Fill in the order form and provide all details of your assignment.
Proceed with the payment
Choose the payment system that suits you most.
Receive the final file
Once your paper is ready, we will email it to you.
Written Assignments has stood as the world’s leading custom essay writing paper services provider. Once you enter all the details in the order form under the place order button, the rest is up to us.
At Written Assignments, we prioritize all aspects that bring about a good grade such as impeccable grammar, proper structure, zero plagiarism and conformance to guidelines. Our experienced team of writers will help you completed your essays and other assignments.
Be assured that you’ll get accepted to the Master’s level program at any university once you enter all the details in the order form. We won’t leave you here; we will also help you secure a good position in your aspired workplace by creating an outstanding resume or portfolio once you place an order.
Our skilled editing and writing team will help you restructure your paper, paraphrase, correct grammar and replace plagiarized sections on your paper just on time. The service is geared toward eliminating any mistakes and rather enhancing better quality.
We have writers in almost all fields including the most technical fields. You don’t have to worry about the complexity of your paper. Simply enter as many details as possible in the place order section.