Outliers in Statistics

ishangottam

In statistics, outliers are values that differ considerably from the bulk of the data set. As such, they can skew various statistical calculations such as the mean and median. Values that are considered outliers are therefore often removed from data sets before any calculations are done. In this article, we’ll go over how to identify outliers in your data sets.

pressfoto / Freepik / “Close-up of businessman with digital tablet” / FreePik license

Examples of outliers

Take the following data set: 4, 6, 5, 3, 7, 31 In this data set, 31 is an outlier because it’s extremely high compared to the rest of the data. 

Let’s take another example, this time in the context of a person’s weekly paycheck: $220, $245, $20, $230. The average of this data set is $130. But this doesn’t make sense because this person usually earns in the mid to low 200s every week. But because the $20 is dragging the average down, it looks like the person makes less than they do. If we discard the $20 data point as an outlier, and recalculate the average, we get about $232, which is much more representative of this person’s salary.

Finding outliers using the interquartile range

In order to use this method, you’ll need a good understanding of how to find the median of a data set. As a review, the median is the middle value of a data set with an odd number of data points, or the average of the middle two values in a data set with an even number of data points.

Find the median

Let’s use the following data set as our example set: 47, 50, 52, 53, 54, 56, 57, 60, 72

Since the data set has an odd number of data points, the median is the middle value, which in this case is 54.

Find the quartiles and interquartile range

Now that we have the median, we can take all the numbers above and below the median and find the median of those numbers. So since our median was 54, we can take every number below 54 (47, 50, 52, 53) and find the median of those data points, which in this case is 51. This is the first quartile (Q1). We now do the same thing for the upper part of the data (56, 57, 60, 65) and get 58.5, the third quartile (Q3). Finally, we calculate the interquartile range by subtracting the third quartile from the first quartile. 58.5 – 51 = 7.5, giving us the interquartile range (IQR).

Check for outliers

Now that we have the interquartile range, we can determine the lower and upper bounds of our data. The formula for the lower bound is Q1 – 1.5(IQR). The formula for the upper bound is Q3 + 1.5(IQR). Plugging in our values, we find that our lower bound is 39.75, and our upper bound is 69.75. This means that any value below 39.75 is a low outlier, and anything above 69.75 is a high outlier. In this case, we have no low outliers, but we have one high outlier, 72.

What to do with outliers

Now that you’ve identified your outliers, what do you do with them? There are a few different things you can do, but doing the correct thing is important to make sure that your results are not overconfident, misleading, or unintentionally biased.

Dropping outliers

In general, you should only drop an outlier from your data set if there is a non-statistical reason that it exists, such as:

  1. Data entry or measurement error: If you have a data point that is physically or logically impossible (such as a human being 200 years old), it is a clear error and should be removed from the data set. However, if it’s possible to correct the error, such as going back into your experimental records to find or reconstruct the correct value, you should do so.
  2. Anormal experimental conditions: If a data point was collected during unusual circumstances, or during circumstances that do not reflect the target conditions of the study or experiment, you can remove the data. For example, data about normal consumer spending habits collected during a recession will likely not be very reliable, as consumers change their spending habits during periods of economic downturn.
  3. Not part of the target population: If the data comes from a different population than the one you’re studying, you can safely remove it. For example, if you’re studying a specific species of bird, and data from a different species makes it in, you should exclude that data.

Other methods of handling outliers

If you can’t drop the outlier, one way to make sure the outlier doesn’t skew the data is to cap the outliers at a certain value. One way to do this is to change every value below the lower bound to be equal to the lower bound and change every value above the upper bound to be equal to the upper bound. You can also use a regression model to fill in the missing data point and replace the outlier with that. The general rule is that unless there’s clear and convincing evidence that an outlier should be removed, it shouldn’t be.

Conclusion

Hopefully, this article has demystified the concept of outliers. While it’s confusing at first, with enough practice, anyone can deal with outliers in their data like a pro. Good luck!

All rights reserved ©2016 - 2026 Achievable, Inc.

Discover more from Achievable Test Prep

Subscribe now to keep reading and get access to the full archive.

Continue reading