How to check if a distribution is normal?

 

 


Many models assumed that the distribution is normal. It is a wise idea to check if your distribution is normal before use it in the model.

In this post I want to explain you, some numerical and visual methods you can use to check if a distribution is normal.

1)    Boxplot-Whisker Plot and Histogram:

Plotting Boxplot-whisker plot and the histogram of the distribution is a visual way you can use to see if the distribution looks normal. The Boxplot-whisker lets us check the symmetry around the mean and the histogram help us to visualize the overall shape of the distribution. Let’s see it in an example:

In this example we create one normal distribution sample and one non-normal distribution sample and use the boxplot-whisker and histogram to visual them.

import numpy as np

import pandas as pd

import scipy.stats as stats

import matplotlib.pyplot as plt

# Sample A: Normal distribution

sample_a = stats.norm.rvs(loc=0.0, scale=1.0, size=(1000,))

# Sample B: Non-normal distribution

sample_b = stats.lognorm.rvs(s=0.5, loc=0.0, scale=1.0, size=(1000,))

 

# Plotting the normal distribution sample

1

2

3

4

5

6

# Sample A: Normal distribution

sample_a = stats.norm.rvs(loc=0.0, scale=1.0, size=(1000,))

fig, axes = plt.subplots(2, 1, figsize=(16, 9), sharex=True)

axes[0].boxplot(sample_a, vert=False)

axes[1].hist(sample_a, bins=50)

axes[0].set_title("Boxplot of a Normal Distribution");

 


 

In the boxplot the bottom edge of the box, the middle line and the top edge of the box are quantiles that divided the data into 4 equal (quartiles) groups. The middle line inside the box is the median of the distribution. The bottom line of the box represents the first quartiles that means 25% of the data points are less than the first quartile (Q1). The upper edge of the box is Q3 quartile 75% of the data points are less than Q3. The box plot has small lines at the end we call these lines whiskers. You can see how they can be calculated in the below image. The point outside of the whiskers can be considered outliers.



When we look at the boxplot we can see if data are symmetric or not. If the data are not symmetric that means the distribution is not normal. In our example you can see that the sample-a is symmetric.

You can also check the data histogram to see if the data are skewed or not.

In the above histogram there is not an apparent skewness in the data.

No let’s plot a non-normal sample.

1

2

3

4

5

6

# Sample B: Non-normal distribution

sample_b = stats.lognorm.rvs(s=0.5, loc=0.0, scale=1.0, size=(1000,))

fig, axes = plt.subplots(2, 1, figsize=(16, 9), sharex=True)

axes[0].boxplot(sample_b, vert=False)

axes[1].hist(sample_b, bins=50)

axes[0].set_title("Boxplot of a Lognormal Distribution");

 


In the above plot you can clearly see that the data is not symmetric and it is skewed. So that means this sample is not normal.

2)    QQ-Plot (Quantile-Quantile Plot):

QQ-plot provides more accurate insight about our data. It compares the shape of our data with the shape of our probability density function (could be a normal distribution). In other word QQ-Plot let us to compare if two distribution has the same shapes. Let’s see if sample-a is normal:

In the below code we assigned “dist” to ‘norm’ that means we want to compare our sample_a with normal distribution.

1

2

3

# Q-Q plot of normally-distributed sample

plt.figure(figsize=(10, 10)); plt.axis('equal')

stats.probplot(sample_a, dist='norm', plot=plt);

 


As you can see the data points (blue points) are almost around the red line (Normal distribution points). We just can see small differences at the two ends of the plots which shows that our sample is not a perfect normal distribution data but still good enough to be considered a normal distributed sample. If you see the QQ-Plot of the non-normal distribution sample below you would see the difference below:


1

2

3

# Q-Q plot of non-normally-distributed sample

plt.figure(figsize=(10, 10)); plt.axis('equal')

stats.probplot(sample_b, dist='norm', plot=plt);

 


As you see the plotted points formed a curve compare to the normal distributed line (red) that suggest the data is not normal.

   Shapiro-Wilk Test:

The Shapiro-Wilk test is a hypothesis test that gives a P-value. The null hypothesis assumes that the distribution is normal. If the p-value is greater than the chosen p-value, we'll assume that it's normal. Otherwise, we assume that it's not normal. We usually set the P-value to 0.05.

 


 

Now it’s time to check normality of the samples using the Shapiro-Wilk test.

                       

1

2

3

4

5

6

7

8

9

10

def is_normal(sample, test=stats.shapiro, p_level=0.05, **kwargs):

    """Apply a normality test to check if sample is normally distributed."""

    t_stat, p_value = test(sample, **kwargs)

    print("Test statistic: {}, p-value: {}".format(t_stat, p_value))

    print("Is the distribution Likely Normal? {}".format(p_value > p_level))

    return p_value > p_level

 Using Shapiro-Wilk test (default)

print("Sample A:-"); is_normal(sample_a);

print("Sample B:-"); is_normal(sample_b);

 

 As you see the result the P-value of the sample-a distribution is much bigger than the P-value = 0.05 so we assume that it is normally distributed while the sample-b has P-value much less than the 0.05 that suggest the sample-b is non-normally distributed.

    Kolmogorov-Smirnov test:

The Kolmogorov-Smirnov test or K-S test compares the data distribution with any theoretical distribution. By choosing ‘norm’ or normal distribution as the theoretical distribution we can test its normality.

In this method we need to specify the mean and standard deviation of this theoretical distribution. We'll set the mean and standard deviation of the theoretical norm with the mean and standard deviation of the data distribution.

 

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20

21

# Sample A: Normal distribution
sample_a = stats.norm.rvs(loc=0.0, scale=1.0, size=(1000,))
# Sample B: Non-normal distribution
sample_b = stats.lognorm.rvs(s=0.5, loc=0.0, scale=1.0, size=(1000,))
 
def is_normal_ks(sample, test=stats.kstest, p_level=0.05):
    """
    p_level: if the test returns a p-value > than p_level, assume normality
       """
    normal_args = (np.mean(sample),np.std(sample))
    
    t_stat, p_value = test(sample, 'norm', normal_args)
    print("Test statistic: {}, p-value: {}".format(t_stat, p_value))
    print("Is the distribution Likely Normal? {}".format(p_value > p_level))
    return p_value > p_level
 
print("Sample A:-"); is_normal_ks(sample_a);

print("Sample B:-"); is_normal_ks(sample_b);


Comments

Popular posts from this blog

Application of GARCH models in R – Part II ( APARCH)

How to scrap Zacks Rank signal in Python