Skew and the Normal Distribution
The first step in analyzing a data set is to examine it's distribution. Histograms are a type of graph that scientist use to visualize the variability in measurements within a sample. Histograms create visual representations of the frequency (Y-axis) of different measurements (X-axis) within the sample. Each bar of a histogram covers a range of numeric measurement values (called a bin) and the height of the bar indicates the frequency of data points with a value within the corresponding bin.
From Kemp, Arika & Harding, Chad & Cabral, Wayne & Marini, Joan & Wallace, Joseph. (2012). Effects of tissue hydration on nanoscale structural morphology and mechanics of individual Type I collagen fibrils in the Brtl mouse model of Osteogenesis Imperfecta. Journal of structural biology. 180. 10.1016/j.jsb.2012.09.012.
Histograms often display a "normal distribution." The normal distribution appears as a "bell curve" when graphed. The normal distribution describes a symmetrical plot of data around its mean value. When data fit a normal distribution, then "parametric" statistics are appropriate for describing the data.
- Parametric statistics assume that the data is normally distributed. Parametric tests are often used for continuous data and are more likely to detect an effect if it exists compared to nonparametric statistics. However, outliers can significantly affect the results of parametric statistics.
- Nonparametric statistics do not make assumptions about the data distribution and are often used for categorical data or continuous data that is not normally distributed. Outliers have less of an affect on the results of nonparametric statistics.
Skewness is a measure of how well the data distribution fits a normal distribution. If the distribution of data for a variable stretches toward the right or left tail of the frequency distribution, then the distribution is characterized as skewed. As the data becomes skewed from a normal distribution, the mean loses its ability to provide the best measure of central tendency.
How to I know if my data is skewed?
You’ll calculate skew for the trials for each level of your manipulated variable. So, if you had five levels of manipulation, you’ll calculate skew 5 times - one time each with the data for each level of your manipulated variable.
A negative skewness indicates a greater number of values larger than the mean, whereas a positive skewness indicates a greater number of values smaller than the mean.
The absolute value of the skew number indicates just how skewed the data is. If skewness = 0, the data are perfectly symmetrical and fit a perfect normal distribution. However, a skewness of exactly zero is quite unlikely for real-world data, so how can you interpret the skewness number? Hair et al. suggests this rule of thumb:
Any absolute value 2.0 or less is fine for using parametric statistics such as the mean as a measure of central tendency. If the absolute value of the skew is more than 2.0, consider using nonparametric statistics such as median to represent the data.
Hair, J. F., Hult, G. T. M., Ringle, C. M., & Sarstedt, M. (2022). A Primer on Partial Least Squares Structural Equation Modeling (PLS-SEM) (3 ed.). Thousand Oaks, CA: Sage.
- Use an online SKEW calculator. Here’s a good option. Be sure to include a citation in your paper if you use this tool.
- Use the “SKEW” function in Microsoft Excel. Here’s directions.
- Use the “SKEW” function in Google Sheets. Here’s the directions.
A negative skewness indicates a greater number of values larger than the mean, whereas a positive skewness indicates a greater number of values smaller than the mean.
The absolute value of the skew number indicates just how skewed the data is. If skewness = 0, the data are perfectly symmetrical and fit a perfect normal distribution. However, a skewness of exactly zero is quite unlikely for real-world data, so how can you interpret the skewness number? Hair et al. suggests this rule of thumb:
- If skewness is between −.5 and +.5, the distribution is approximately symmetrical.
- If skewness is between −1 and −.5 or between +.5 and +1, the distribution is slightly skewed.
- If skewness is less than −1 or greater than +1, the distribution is moderately skewed but acceptable.
- Skew values beyond −2 and +2 are considered indicative of substantial nonnormality.
Any absolute value 2.0 or less is fine for using parametric statistics such as the mean as a measure of central tendency. If the absolute value of the skew is more than 2.0, consider using nonparametric statistics such as median to represent the data.
Hair, J. F., Hult, G. T. M., Ringle, C. M., & Sarstedt, M. (2022). A Primer on Partial Least Squares Structural Equation Modeling (PLS-SEM) (3 ed.). Thousand Oaks, CA: Sage.