In statistics, outliers are values that differ considerably from the bulk of the data set. As such, they can skew various statistical calculations such as the mean and median. Values that are considered outliers are therefore often removed from data sets before any calculations are done. In this article, we’ll go over how to identify outliers in your data sets.

Take the following data set: 4, 6, 5, 3, 7, 31 In this data set, 31 is an outlier because it’s extremely high compared to the rest of the data.
Let’s take another example, this time in the context of a person’s weekly paycheck: $220, $245, $20, $230. The average of this data set is $130. But this doesn’t make sense because this person usually earns in the mid to low 200s every week. But because the $20 is dragging the average down, it looks like the person makes less than they do. If we discard the $20 data point as an outlier, and recalculate the average, we get about $232, which is much more representative of this person’s salary.
In order to use this method, you’ll need a good understanding of how to find the median of a data set. As a review, the median is the middle value of a data set with an odd number of data points, or the average of the middle two values in a data set with an even number of data points.
Let’s use the following data set as our example set: 47, 50, 52, 53, 54, 56, 57, 60, 72
Since the data set has an odd number of data points, the median is the middle value, which in this case is 54.
Now that we have the median, we can take all the numbers above and below the median and find the median of those numbers. So since our median was 54, we can take every number below 54 (47, 50, 52, 53) and find the median of those data points, which in this case is 51. This is the first quartile (Q1). We now do the same thing for the upper part of the data (56, 57, 60, 65) and get 58.5, the third quartile (Q3). Finally, we calculate the interquartile range by subtracting the third quartile from the first quartile. 58.5 – 51 = 7.5, giving us the interquartile range (IQR).
Now that we have the interquartile range, we can determine the lower and upper bounds of our data. The formula for the lower bound is Q1 – 1.5(IQR). The formula for the upper bound is Q3 + 1.5(IQR). Plugging in our values, we find that our lower bound is 39.75, and our upper bound is 69.75. This means that any value below 39.75 is a low outlier, and anything above 69.75 is a high outlier. In this case, we have no low outliers, but we have one high outlier, 72.
Now that you’ve identified your outliers, what do you do with them? There are a few different things you can do, but doing the correct thing is important to make sure that your results are not overconfident, misleading, or unintentionally biased.
In general, you should only drop an outlier from your data set if there is a non-statistical reason that it exists, such as:
If you can’t drop the outlier, one way to make sure the outlier doesn’t skew the data is to cap the outliers at a certain value. One way to do this is to change every value below the lower bound to be equal to the lower bound and change every value above the upper bound to be equal to the upper bound. You can also use a regression model to fill in the missing data point and replace the outlier with that. The general rule is that unless there’s clear and convincing evidence that an outlier should be removed, it shouldn’t be.
Hopefully, this article has demystified the concept of outliers. While it’s confusing at first, with enough practice, anyone can deal with outliers in their data like a pro. Good luck!