Exploratory Data Analysis-A first look at the data.
Exploratory Data Analysis
A first look at the data.
|
Type-1 |
|
|
|
Type-2 |
Graphical |
Non-Graphical |
ROLE |
TYPE |
Univariate |
Univariate Graphical |
Univariate Non-Graphical |
Outcome role & Explanatory role |
Categorical Type and Quantitative Type |
Multivariate |
Multivariate Graphical |
Multivariate Non-Graphical |
||
|
Summarize the data in a diagrammatic or pictorial
way. |
Calculation of summary statistics |
|
Univariate Non-Graphical |
Categorical data
|
||||||||||||||||||||||||
The characteristics of interest |
The characteristics of interest for a categorical variable are simply the range
of values and the frequency (or relative frequency) of occurrence for
each value. |
||||||||||||||||||||||||
Univariate non-graphical techniques |
Therefore, the only useful univariate non-graphical
techniques for categorical variables is some form of tabulation of the
frequencies, usually along with calculation of the fraction (or percent) of data that falls in each
category. |
||||||||||||||||||||||||
Example |
For example, if we categorize grades as Matric, Fsc, Bsc
and “other”, then there is a true
population of all students enrolled in the 2007 Fall semester. If we take a random sample
of 20 students EDA would look like this: |
||||||||||||||||||||||||
|
|
Univariate Non-Graphical |
Quantitative data
|
||
The characteristics of interest |
The characteristics of the population distribution of a quantitative variable are its center,
spread, modality (number of peaks in the pdf), shape (including “heaviness of
the tails”), and outliers.
|
||
|
Univariate EDA for a quantitative variable is
a way to make preliminary assessments about the population distribution of
the variable using the
data of the observed sample. |
||
|
For quantitative variables (and
possibly for ordinal variables) it is worthwhile looking at the central tendency, spread, skewness, and
kurtosis of the data for a particular
variable from an experiment. But for categorical variables, none of
these make any sense. |
||
Univariate non-graphical techniques |
If the
quantitative variable does not have too many distinct values, a tabulation, as we used for categorical data, will be a worthwhile
univariate, non-graphical technique.
|
||
|
But mostly,
for quantitative variables we are concerned here with the quantitative numeric (non-graphical) measures which are
the various sample statistics. In fact, sample statistics are generally
thought of as estimates of the corresponding population parameters. |
||
|
We can calculate “sample statistics” from the
data, such as sample mean, sample variance, sample
standard deviation, sample skewness and sample
kurtosis. |
||
Sample Distribution |
The sample
of measurements of a particular variable, , we select
for our particular experiment for observation is called “sample
distribution”. |
||
Central Tendency |
The central
tendency or “location” of a distribution has to do with typical or middle values.
The common, useful
measures of central tendency are the statis tics
called (arithmetic) mean,
median, and sometimes mode. |
||
Arithmetic Mean |
The arithmetic mean is simply the sum of all
of the data values divided by the number of values. It can be thought of as how
much each subject gets in a “fair” re-division
of whatever the data are measuring. |
||
|
For any symmetrically shaped distribution (i.e.,
one with a symmetric histogram or pdf or pmf) the
mean is the point around
which the symmetry holds. |
||
|
For non-symmetric distributions, the mean is the “balance point” |
||
Median |
The median
is another measure
of central tendency. The sample median
is the middle value after all of the values are put in an ordered
list. If there are an even number
of values, take the average of the two middle values. |
||
|
For symmetric distributions, the mean and the median coincide. For unimodal skewed (asymmetric) distributions, the mean is farther in the direction of the “pulled
out tail” of the distribution than the median
is. Therefore, for many cases
of skewed distributions, the median is preferred as a measure
of central tendency. |
||
|
The median
has a very special property called robustness. A sample statistic is “robust” if moving some data tends not to change the value
of the statistic. The median is highly robust, because you can
move nearly all of the upper half and/or lower
half of the data values any distance away from the median without changing the median. More practically, a few very high values
or very low values usually
have no effect
on the median. |
||
Mode |
A rarely
used measure of central tendency
is the mode, which is the most likely
or frequently occurring value. |
||
|
More
commonly we simply use the term “mode” when describing whether a distribution has a single
peak (unimodal) or two or more peaks
(bimodal or multi-modal). |
||
|
In symmetric, unimodal distributions, the mode equals
both the mean and the median. |
||
|
In unimodal, skewed distributions the mode is on the other side of the median from the mean.
|
||
|
In multi-modal distributions there is either no unique highest mode, or the highest mode may well be unrepresentative of the central tendency. |
||
Spread |
Spread is an
indicator of how far away from the center we are still likely to find data values. |
||
variance |
The variance
is a standard measure of spread. |
||
|
The variance and standard deviation are two useful measures of spread. The variance is the mean of the squares
of the individual deviations. The
standard deviation is the square root of the variance. For Normally distributed data, approximately 95% of the values lie within 2 sd of the mean. |
||
Standard Deviation |
The standard
deviation is simply
the square root of the variance. Therefore it has the
same units as the original data, which helps
make it more
interpretable. |
||
interquartile range |
− |
||
|
The quartiles of a population or a sample are
the three values which divide the distribution or observed data into even fourths. |
||
|
So
one quarter of the data fall below the first quartile, usually written Q1;
one half fall below the second quartile (Q2); and three fourths fall below
the third quartile (Q3). |
||
|
The astute reader will realize that half of the values
fall above Q2, one quarter fall
above Q3, and also that Q2 is a synonym for the median. |
||
|
Once the quartiles are defined, it is easy
to define the IQR as IQR = Q3 Q1. By definition, half of the values (and
specifically the middle half) fall within an
interval whose width equals the IQR. If the data are more spread out,
then the IQR tends to increase, and vice versa. |
||
|
The IQR is a more robust measure of spread than the
variance or standard deviation. Any number of values in the top or bottom
quarters of the data can
be moved any distance from the median without affecting the IQR at
all. More practically, a few extreme outliers
have little or no effect on the IQR. |
||
Range |
The range of a sample
is the distance from the minimum
value to the maximum value:
range = maximum -
minimum. |
||
|
In contrast to the IQR, the range of the data is not very robust at all. If you collect repeated samples from a
population, the minimum, maximum
and range tend to change drastically from sample to sample, while the variance and standard deviation change less,
and the IQR least of all. |
||
|
The minimum and maximum of a sample may be useful for
detecting outliers, especially if you know something about the possible reasonable values for
your variable. They often (but certainly not always) can detect data
entry errors such as typing a digit
twice or transposing digits (e.g., entering 211 instead of 21 and entering 19 instead of 91 for data that represents ages of senior
citizens.) |
||
|
The IQR has one more property worth knowing: for
normally distributed data only, the IQR approximately equals 4/3
times the standard deviation. This means that
for Gaussian distributions, you can
approximate the sd from the IQR by calculating 3/4 of the
IQR. |
||
Skewness |
Two additional useful
univariate descriptors are the skewness and kurtosis of a distribution. Skewness is a measure of asymmetry. |
||
Kurtosis |
Kurtosis is a measure of “peakedness” relative to a Gaussian shape. |
||
|
Skewness is a measure of asymmetry. Kurtosis is a more subtle mea sure of peakedness compared to a Gaussian distribution. |
Univariate graphical EDA
|
Categorical Data |
|
|
While the non-graphical methods are
quantitative and objective, they do not give
a full picture of the data;
therefore, graphical methods, which are more
qualitative and involve
a degree of subjective analysis, are also required. |
|
Histograms
|
The only one of these techniques that makes
sense for categorical data is the histogram (basically just a barplot of the tabulation of the data).
|
|
|
A pie chart is equivalent, but not often used. The concepts of central tendency, spread
and skew have no meaning for
nominal categorical data. For ordinal categorical data, it sometimes makes sense to treat the
data as quantitative for EDA purposes; you
need to use your judgment
here |
|
|
The most basic graph is the histogram, which is a barplot in
which each bar represents the
frequency (count) or proportion (count/total count) of cases for a range of values. Typically the bars run vertically with the count (or
proportion) axis running vertically. To
manually construct a histogram, define the range of data for
each bar (called a bin), count how
many cases fall in each bin, and draw the bars
high enough to indicate the count. |
|
|
histograms
are one of the best
ways to quickly learn a lot about
your data, including central tendency, spread, modality, shape
and outliers. |
|
A stem and
leaf plot |
A stem and
leaf plot shows
all data values
and the shape
of the distribution. |
|
Boxplots |
Boxplots are very good at presenting information about
the central tendency, symmetry and
skew, as well as outliers, although they can be misleading about aspects such as multimodality. One of
the best uses of boxplots is in the form of
side-by-side boxplots (see multivariate graphical analysis below). |
|
|
|
The boxplot
consists of a rectangular box bounded above and below by “hinges” that represent the quartiles Q3 and Q1 respectively, and with a horizontal “median” line through it. You can also see the upper and lower
“whiskers”, and a point marking an “outlier”. The vertical axis is in the units of the quantitative variable. |
|
Let’s assume that the subjects for this experiment are
hens and the data represent the number
of eggs that
each hen laid
during the experiment. We can read
certain information directly off of the graph. The median (not mean!) is 4 eggs,
so no more than half of the hens laid more than 4 eggs and no more
than half of the hens
laid less than
4 eggs. (This
is based on the technical definition of median; we would usually claim that half of the hens lay more or half
less than 4, knowing that this may
be only approximately correct.) We can also state that one quarter of the hens lay less than 3 eggs and one
quarter lay more than 5 eggs (again, this may
not be exactly correct, particularly for small samples or a small number of different possible values). This leaves
half of the hens, called the “central half”, to lay between 3 and 5 eggs, so the interquartile range (IQR) is Q3-Q1=5-3=2. |
Multivariate non-graphical EDA
|
Categorical Data |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Multivariate non-graphical EDA
techniques generally show
the relationship be tween two or more variables in the form of either
cross-tabulation or statistics. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Cross-tabulation
|
For categorical data (and quantitative data
with only a few different values) an extension
of tabulation called cross-tabulation is
very useful. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
For two variables, cross-tabulation is performed by making a two-way table with
column headings that match the levels
of one variable and row headings that match the levels of the other variable, then filling in the
counts of all subjects that share a pair of
levels. The two variables
might be both explanatory, both outcome, or one of each. Depending on the goals,
row percentages (which
add to 100%
for each row),
column percentages (which add to 100% for each column)
and/or cell percentages (which add to 100% over all cells)
are also useful. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Table.1: Sample Data for Cross-tabulation |
Table.2: Cross-tabulation of Sample Data |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Here is an
example of a cross-tabulation. Consider the data in table.1. For each subject
we observe sex and age as categorical variables. |
We can easily see that the total number of young females is 2, and we
can calculate, e.g., the corresponding cell percentage is 2/11 × 100 = 18.2%, the row percentage is 2/5
× 100 = 40.0%, and the column
percentage is 2/7 × 100 = 28.6%. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Cross-tabulation can be extended to three (and sometimes
more) variables by making separate two-way tables for two variables at each level
of a third variable. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Correlation |
Another statistic that
can be calculated for two categorical variables is their
corre lation. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The correlation between two
random variables is a number that runs from -1 through 0 to +1 and indicates a strong inverse relationship, no relationship, and a strong
direct relationship, respectively. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
For two quantitative variables, the basic statistics of interest
are the sample co- variance and/or
sample correlation, which correspond to and are estimates of the corresponding population parameters. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Covariance |
The sample covariance is a measure
of how much two variables “co-vary”, i.e., how much (and in what direction) should
we expect one
variable to change
when the other
changes. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Positive covariance values suggest that
when one measurement is above the
mean the other will
probably also be above the mean, and vice versa.
Negative covariances suggest that when one variable is above its mean,
the other is below its mean. And covariances near zero suggest that
the two variables vary independently of each other. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Technically, independence implies zero correlation, but the reverse is not necessarily true. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Covariances tend to be hard to interpret, so
we often use correlation instead. The correlation has the nice
property that it is always
between -1 and +1, with -1 being a “perfect” negative
linear correlation, +1 being a perfect positive linear correlation and 0 indicating that X and Y are uncorrelated. The symbol r or
rx,y is often used for sample correlations. |
Multivariate graphical EDA
|
|
|
There are few useful
techniques for graphical EDA of two categorical random
variables. The only one used
commonly is a grouped barplot with each group rep- resenting one level of one of the variables and each bar
within a group representing the levels
of the other
variable. |
In a nutshell:
You should always perform appropriate EDA before further analysis of your
data. Perform whatever steps are necessary to become more familiar with your data, check for obvious mistakes, learn
about variable distributions, and learn about relationships be- tween variables. EDA is not an exact
science – it is a very important art!
Comments
Post a Comment