Skip to content

Understanding Data Variability: MAD, Variance, and MedAD in Python

Published: at 12:53 AM

Understanding the variability or spread of data is a critical part of exploratory data analysis. Measuring the variability provides insights into the distribution and spread of data points in a dataset. This allows us to understand the clustering or dispersion of values, detect outliers, and guide modeling or analysis approaches.

In this comprehensive guide, we will examine key statistical measures of variability in Python, including:

We will look at how these measures work, their mathematical calculation, code examples in Python, pros and cons of each method, and recommendations for usage. Real-world case studies are provided to illustrate practical applications.

Introduction

In data analysis, we often summarize datasets using measures of central tendency like the mean or median. However, these only provide information about the middle or typical value. To get a complete picture, we also need to examine the variability or spread of the data.

Here are some key reasons why measuring variability is important:

Python provides easy access to statistical functions and packages to calculate different variability measures. Let’s examine some of the most widely used techniques.

Mean Absolute Deviation (MAD)

The mean absolute deviation (MAD) calculates the average absolute difference between each data point and the mean. It measures the variability relative to the mean.

How MAD is calculated

For a dataset X with N values, the MAD is calculated as:

mean = sum(X) / N

deviations = [abs(x - mean) for x in X]

MAD = sum(deviations) / N

Here, we:

  1. Calculate the mean of all data points
  2. Find the absolute deviation of each point from the mean
  3. Take the average of the absolute deviations

The pseudo-code summarizes the steps:

MeanAbsoluteDeviation(X):

  mean = Mean(X)

  deviations = empty list

  for x in X:
    deviation = abs(x - mean)
    deviations.append(deviation)

  MAD = Mean(deviations)

  return MAD

MAD Example in Python

Let’s see an example code to calculate MAD on a dataset in Python:

import numpy as np

data = [1, 4, 5, 7, 10, 14]

mean = np.mean(data)
print("Mean:", mean)

deviations = []
for x in data:
    deviations.append(abs(x - mean))

MAD = np.mean(deviations)
print("MAD:", MAD)

This prints:

Mean: 6.833333333333333
MAD: 3.5000000000000004

We can also write this more concisely using list comprehensions:

import numpy as np

data = [1, 4, 5, 7, 10, 14]

mean = np.mean(data)

MAD = np.mean([abs(x - mean) for x in data])

print("MAD:", MAD)

Advantages of MAD

Some key advantages of using MAD include:

Limitations of MAD

Some limitations to note about MAD:

Variance and Standard Deviation

The variance and standard deviation are widely used in statistics to measure variability. Unlike MAD, they square the deviations from the mean, making them more sensitive to larger differences.

Calculating Variance

The variance calculates the average of squared differences from the mean.

The formula is:

mean = sum(X) / N

squared_diffs = [(x - mean)**2 for x in X]

variance = sum(squared_diffs) / N

For a sample of data, the sample variance formula divides by N-1 rather than N:

sample_variance = sum(squared_diffs) / (N - 1)

This provides an unbiased estimate of the population variance.

We can translate this to Python code:

import numpy as np

data = [1, 4, 5, 7, 10, 14]

mean = np.mean(data)

squared_diffs = [(x - mean)**2 for x in data]
variance = np.mean(squared_diffs)

print("Variance:", variance)

This returns:

Variance: 17.805555555555554

Squaring the deviations makes large differences from the mean more pronounced.

Standard Deviation

While variance measures the average squared differences, standard deviation is the square root of the variance.

So the standard deviation formula is:

std_dev = sqrt(variance)

And the sample standard deviation uses the sample variance:

sample_std_dev = sqrt(sample_variance)

Here is an example:

import numpy as np

data = [1, 4, 5, 7, 10, 14]

mean = np.mean(data)
squared_diffs = [(x - mean)**2 for x in data]
variance = np.mean(squared_diffs)

std_dev = np.sqrt(variance)

print("Standard Deviation:", std_dev)

This calculates the standard deviation as 4.2196629670573875.

Taking the square root returns the standard deviation to the same scale as the original data, making it interpretable. Variance, being squared, is not on the original scale.

Advantages of Standard Deviation

Key benefits of using standard deviation:

Limitations of Standard Deviation

Some limitations to note:

Median Absolute Deviation (MedAD)

The median absolute deviation (MedAD) uses the median instead of mean when calculating absolute deviations. This makes it a robust measure of variability.

Calculating MedAD

The MedAD is calculated as:

median = median(X)
# E.g. NumPy: median = np.median(X)

deviations = [abs(x - median) for x in X]

MedAD = median(deviations)

The steps involved are:

  1. Calculate median of dataset
  2. Find absolute deviation of each point from median
  3. Take median of the absolute deviations

Here is sample Python code:

import numpy as np

data = [1, 4, 5, 7, 10, 14]

median = np.median(data)

deviations = [abs(x - median) for x in data]
MedAD = np.median(deviations)

print("Median Absolute Deviation:", MedAD)

This returns 3.0 as the MedAD.

MedAD as Robust Measure

Since the median is less affected by outliers, MedAD provides a robust measure of variability and dispersion unaffected by anomalies. This makes it more reliable than standard deviation in many cases.

MedAD is a helpful descriptive statistic and useful for detecting outliers. It can highlight unusual points lying far from the center of distribution.

Limitations of MedAD

MedAD has some limitations to consider:

Comparing Variability Measures

Here is a summary comparing the key variability measures discussed:

MeasureBasisCalculationRobustnessInterpretability
MADMeanAbsolute deviations from meanModerateIntuitive units
Standard DeviationMeanSquare root of squared deviations from meanLowNatural scale but not intuitive units
MedADMedianMedian of absolute deviations from medianHighHarder to interpret conceptually

In general:

Recommendations for Usage

Based on their properties, here are some recommendations on when to use each measure:

In practice:

Real-World Examples and Case Studies

Let’s look at some real-world examples applying these measures of variability on different datasets.

Case Study 1: Evaluating Experiment Results

In a research study focused on gauging human reaction times to different stimuli, a group of researchers collected a dataset of reaction times, measured in milliseconds. The dataset comprises the following values:

[160, 301, 186, 213, 137, 211, 152, 191]

The primary objective of this case study is to evaluate the variability present in the observed reaction times using several statistical measures. Specifically, we will be calculating the Standard Deviation, Mean Absolute Deviation (MAD), Median Absolute Deviation (MedAD), and identifying any potential outliers within the dataset.

Standard Deviation:

The standard deviation is a statistical metric that quantifies the amount of variation or dispersion in a dataset. In simpler terms, it tells us how spread out the values are from the mean (average). A higher standard deviation indicates greater variability.

Mean Absolute Deviation (MAD):

MAD, a measure of statistical dispersion, provides insights into the average magnitude of the deviations (differences) between each data point and the mean. It is calculated by finding the average absolute difference between each data point and the mean.

Median Absolute Deviation (MedAD):

Similar to MAD, MedAD calculates the average absolute deviation, but instead of using the mean, it employs the median as the reference point. The median is the middle value in a sorted list of numbers, which makes it resistant to extreme values or outliers.

Outliers:

Outliers are values in a dataset that significantly deviate from the rest of the data points. They can potentially distort statistical analyses and should be identified and assessed. One common method for identifying outliers is the Interquartile Range (IQR) method. It involves calculating the first quartile (Q1), third quartile (Q3), and then determining the range within which most data points lie. Values falling outside this range are considered outliers.

Now, let’s proceed with the Python code to perform these calculations:

# Import necessary libraries
import numpy as np

# Define the reaction times dataset
reaction_times = [160, 301, 186, 213, 137, 211, 152, 191]

# Calculate the standard deviation
std_dev = np.std(reaction_times)

# Calculate the Mean Absolute Deviation (MAD)
mean = np.mean(reaction_times)
mad = np.mean(np.abs(np.array(reaction_times) - mean))

# Calculate the Median Absolute Deviation (MedAD)
med = np.median(reaction_times)
medad = np.median(np.abs(np.array(reaction_times) - med))

# Identify outliers using the IQR method
q1 = np.percentile(reaction_times, 25)
q3 = np.percentile(reaction_times, 75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

outliers = [x for x in reaction_times if x < lower_bound or x > upper_bound]

# Print the results
print("Reaction Times:", reaction_times)
print("Standard Deviation:", std_dev)
print("Mean Absolute Deviation (MAD):", mad)
print("Median Absolute Deviation (MedAD):", medad)
print("Outliers:", outliers)

This prints:

Reaction Times: [160, 301, 186, 213, 137, 211, 152, 191]
Standard Deviation: 47.88120064284103
Mean Absolute Deviation (MAD): 35.84375
Median Absolute Deviation (MedAD): 26.5
Outliers: [301]

This code snippet helps us better understand the variability in the reaction times dataset. It calculates the standard deviation, MAD, and MedAD to measure dispersion and identifies any potential outliers using the IQR method. This analysis is crucial for assessing the reliability and significance of experimental results.

Analysis of Reaction Time Data:

The dataset containing reaction times, measured in milliseconds, was subjected to statistical analysis to understand the variability and potential outliers. Here are the key findings:

  1. Standard Deviation: The standard deviation, a measure of how spread out the data points are from the mean, is approximately 47.88 milliseconds. This value suggests that there is a moderate level of variability in the reaction times within the dataset.

  2. Mean Absolute Deviation (MAD): The Mean Absolute Deviation, which provides an average measure of how much each data point deviates from the mean, is calculated to be approximately 35.84 milliseconds. This indicates that, on average, individual reaction times deviate by around 35.84 milliseconds from the mean reaction time.

  3. Median Absolute Deviation (MedAD): The Median Absolute Deviation, which employs the median as a reference point to measure deviation, is found to be approximately 26.5 milliseconds. This metric is less sensitive to extreme values compared to the MAD. It suggests that the majority of reaction times deviate by around 26.5 milliseconds from the dataset’s median value.

  4. Outliers: One outlier, with a reaction time of 301 milliseconds, was identified within the dataset. Outliers are data points that significantly deviate from the rest of the data. In this case, the outlier represents an unusually long reaction time compared to the others.

The statistical analysis reveals that the dataset exhibits a moderate level of variability in reaction times, with one notable outlier. Researchers should carefully consider the impact of this outlier when interpreting the experimental results, as it could potentially skew the overall findings. Further investigation or data validation may be necessary to ensure the accuracy and reliability of the results.

Case Study 2: Analyzing Housing Prices

In the realm of real estate analysis, understanding the variability in housing prices is of paramount importance. To shed light on this matter, a real estate analyst has gathered data on recent housing prices, all measured in thousands of US dollars (USD):

[225, 350, 319, 255, 116, 412, 178, 435]

The objective of this case study is to quantify the variability within this dataset using statistical measures. Specifically, we will compute the Standard Deviation, Mean Absolute Deviation (MAD), Median Absolute Deviation (MedAD), and identify any potential outliers.

Now, let’s proceed with the Python code to conduct these calculations:

# Import necessary libraries
import numpy as np

# Define the housing prices dataset
housing_prices = [225, 350, 319, 255, 116, 412, 178, 435]

# Calculate the standard deviation
std_dev = np.std(housing_prices)

# Calculate the Mean Absolute Deviation (MAD)
mean = np.mean(housing_prices)
mad = np.mean(np.abs(np.array(housing_prices) - mean))

# Calculate the Median Absolute Deviation (MedAD)
med = np.median(housing_prices)
medad = np.median(np.abs(np.array(housing_prices) - med))

# Identify outliers using the IQR method
q1 = np.percentile(housing_prices, 25)
q3 = np.percentile(housing_prices, 75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

outliers = [x for x in housing_prices if x < lower_bound or x > upper_bound]

# Print the results
print("Housing Prices (in thousands of USD):", housing_prices)
print("Standard Deviation:", std_dev)
print("Mean Absolute Deviation (MAD):", mad)
print("Median Absolute Deviation (MedAD):", medad)
print("Outliers:", outliers)

This prints:

Housing Prices (in USD thousands): [225, 350, 319, 255, 116, 412, 178, 435]
Standard Deviation: 105.18287645810035
Mean Absolute Deviation (MAD): 92.75
Median Absolute Deviation (MedAD): 86.0
Outliers: []

This code facilitates a comprehensive understanding of the variability in housing prices. It computes the standard deviation, MAD, and MedAD to quantify the dispersion and identifies any potential outliers, which are vital insights for real estate market analysis. Researchers and analysts can utilize this information to make informed decisions and gain a deeper understanding of market dynamics.

Analysis of Housing Price Variability:

The dataset comprising recent housing prices, measured in thousands of US dollars (USD), has undergone thorough statistical analysis to unveil the market’s price variability. Here are the key findings:

  1. Standard Deviation: The standard deviation, a metric that signifies the extent of price spread or dispersion, is approximately 105.18 thousand USD. This value suggests a notable degree of variability in housing prices within the dataset. It implies that prices are relatively spread out from the average.

  2. Mean Absolute Deviation (MAD): The Mean Absolute Deviation, a measure of the average magnitude of price deviations from the mean, calculates to approximately 92.75 thousand USD. On average, individual housing prices deviate by approximately 92.75 thousand USD from the dataset’s mean price.

  3. Median Absolute Deviation (MedAD): The Median Absolute Deviation, which considers the median price as a reference point for deviation, is approximately 86.0 thousand USD. This metric is less sensitive to extreme price values compared to MAD. It suggests that the majority of housing prices deviate by around 86.0 thousand USD from the dataset’s median price.

  4. Outliers: Remarkably, there are no outliers identified within this dataset. Outliers are values that significantly deviate from the typical range of data points. The absence of outliers indicates a relatively stable and well-behaved distribution of housing prices.

The statistical analysis reveals that while there is notable variability in housing prices, there are no extreme outliers present. This finding suggests a more stable and predictable housing market, which can be valuable information for real estate professionals and investors when assessing market conditions and making informed decisions.

Case Study 3: Evaluating Algorithm Performance

In the realm of data analysis and machine learning, assessing the performance of algorithms is crucial for making informed decisions and improving models. In this case study, an algorithm’s performance was tested on a dataset, and accuracy scores were recorded:

[0.81, 0.73, 0.90, 0.93, 0.55, 0.97, 0.79, 0.63]

The primary objective here is to evaluate the consistency and variability in these accuracy scores, which can provide valuable insights into the algorithm’s reliability.

Now, let’s proceed with the Python code to perform these calculations:

# Import necessary libraries
import numpy as np

# Define the accuracy scores dataset
accuracy_scores = [0.81, 0.73, 0.90, 0.93, 0.55, 0.97, 0.79, 0.63]

# Calculate the standard deviation
std_dev = np.std(accuracy_scores)

# Calculate the Mean Absolute Deviation (MAD)
mean = np.mean(accuracy_scores)
mad = np.mean(np.abs(np.array(accuracy_scores) - mean))

# Calculate the Median Absolute Deviation (MedAD)
med = np.median(accuracy_scores)
medad = np.median(np.abs(np.array(accuracy_scores) - med))

# Identify outliers using the IQR method
q1 = np.percentile(accuracy_scores, 25)
q3 = np.percentile(accuracy_scores, 75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

outliers = [x for x in accuracy_scores if x < lower_bound or x > upper_bound]

# Print the results
print("Accuracy Scores:", accuracy_scores)
print("Standard Deviation:", std_dev)
print("Mean Absolute Deviation (MAD):", mad)
print("Median Absolute Deviation (MedAD):", medad)
print("Outliers:", outliers)

This prints:

Accuracy Scores: [0.81, 0.73, 0.9, 0.93, 0.55, 0.97, 0.79, 0.63]
Standard Deviation: 0.13751704439813997
Mean Absolute Deviation (MAD): 0.1140625
Median Absolute Deviation (MedAD): 0.11499999999999999
Outliers: []

This code facilitates a comprehensive evaluation of the algorithm’s performance by quantifying the variability in accuracy scores. It calculates the standard deviation, MAD, and MedAD to measure the dispersion and identifies any potential outliers using the IQR method. These insights can be invaluable for assessing the algorithm’s reliability and identifying areas for improvement.

Analysis of Algorithm Performance:

The dataset contains accuracy scores representing the performance of an algorithm on a given task, ranging from 0.55 to 0.97:

Standard Deviation: The standard deviation, a measure of the variability in accuracy scores, is approximately 0.138. This relatively small value indicates a moderate level of consistency in the algorithm’s performance. In other words, the accuracy scores are closely clustered around the mean.

Mean Absolute Deviation (MAD): The Mean Absolute Deviation, measuring the average magnitude of deviations from the mean accuracy, is approximately 0.114. This value suggests that, on average, individual accuracy scores deviate by about 0.114 from the mean accuracy. This demonstrates a relatively stable and consistent performance.

Median Absolute Deviation (MedAD): The Median Absolute Deviation, which employs the median accuracy as a reference point for deviation, is approximately 0.115. Similar to MAD, this metric also indicates a consistent and reliable algorithm performance.

Outliers: Remarkably, there are no outliers identified within this dataset. Outliers are values that significantly deviate from the typical range of data points. The absence of outliers is indicative of a stable and reliable algorithm performance across the recorded accuracy scores.

The statistical analysis reveals a consistent and stable algorithm performance, with no extreme outliers. These findings suggest that the algorithm exhibits a reliable level of accuracy, which is valuable for various applications where consistency is essential, such as machine learning model evaluations or quality control processes.

Practical Applications

Measuring data variability has many practical applications, including:

Conclusion

Measuring and analyzing data variability provides critical insights that summary statistics alone cannot reveal. By calculating the mean absolute deviation, variance, standard deviation, and median absolute deviation, we can better understand the dispersion, distribution, outliers, patterns, and changes in data.

Python provides easy access to statistical functions and packages like NumPy, SciPy, pandas, scikit-learn, statsmodels, and Matplotlib to apply these variability measures on datasets. Calculating multiple measures provides a more complete picture than relying solely on standard deviation. The aim is to select the most suitable technique based on factors like data characteristics, presence of anomalies, and intended usage.

With this guide, you should have a solid grasp of calculating and interpreting key variability measures in Python. The examples, code snippets, and case studies hopefully illustrate how to apply these techniques to gain deeper insights from your data.