How to check if a distribution is normal?
Many models assumed that the distribution is normal. It is a
wise idea to check if your distribution is normal before use it in the model.
In this post I want to explain you, some numerical and
visual methods you can use to check if a distribution is normal.
1) Boxplot-Whisker
Plot and Histogram:
Plotting Boxplot-whisker plot and the histogram
of the distribution is a visual way you can use to see if the distribution
looks normal. The Boxplot-whisker lets us check the symmetry around the mean
and the histogram help us to visualize the overall shape of the distribution. Let’s
see it in an example:
In this example we create one normal distribution
sample and one non-normal distribution sample and use the boxplot-whisker and
histogram to visual them.
|
# Plotting the normal distribution sample
|
In the boxplot the bottom edge of the box, the middle line
and the top edge of the box are quantiles that divided the data into 4 equal (quartiles)
groups. The middle line inside the box is the median of the distribution. The bottom
line of the box represents the first quartiles that means 25% of the data
points are less than the first quartile (Q1). The upper edge of the box is Q3 quartile
75% of the data points are less than Q3. The box plot has small lines at the
end we call these lines whiskers. You can see how they can be calculated in the
below image. The point outside of the whiskers can be considered outliers.
When we look at the boxplot we can see if data are symmetric
or not. If the data are not symmetric that means the distribution is not
normal. In our example you can see that the sample-a is symmetric.
You can also check the data histogram to see if the data are
skewed or not.
In the above histogram there is not an apparent skewness in
the data.
No let’s plot a non-normal sample.
|
In the above plot you can clearly see that the data is not
symmetric and it is skewed. So that means this sample is not normal.
2) QQ-Plot (Quantile-Quantile
Plot):
QQ-plot provides more accurate insight
about our data. It compares the shape of our data with the shape of our
probability density function (could be a normal distribution). In other word
QQ-Plot let us to compare if two distribution has the same shapes. Let’s see if
sample-a is normal:
In the below code we assigned “dist”
to ‘norm’ that means we want to compare our sample_a with normal
distribution.
|
As you can see the data points (blue
points) are almost around the red line (Normal distribution points). We just
can see small differences at the two ends of the plots which shows that our
sample is not a perfect normal distribution data but still good enough to be
considered a normal distributed sample. If you see the QQ-Plot of the non-normal
distribution sample below you would see the difference below:
|
As you see the plotted points formed a
curve compare to the normal distributed line (red) that suggest the data is not
normal.
Shapiro-Wilk Test:
The Shapiro-Wilk test is a hypothesis test
that gives a P-value. The null hypothesis assumes that the distribution is
normal. If the p-value is greater than the chosen p-value, we'll assume that
it's normal. Otherwise, we assume that it's not normal. We usually set the
P-value to 0.05.
Now it’s time to check normality of the
samples using the Shapiro-Wilk test.
|
The Kolmogorov-Smirnov test or K-S test compares
the data distribution with any theoretical distribution. By choosing ‘norm’ or
normal distribution as the theoretical distribution we can test its normality.
In this method we need to specify the mean
and standard deviation of this theoretical distribution. We'll set the mean and
standard deviation of the theoretical norm with the mean and standard deviation
of the data distribution.
|
Comments
Post a Comment