Talking Data: Some Basic yet Useful Statistics in Data Analysis
Introduction
"All models are wrong, but some are useful."--George Box
For many people, data analysis is familiar but inaccessible. Data rookies think that fancy dashboards look cool. Operational managers think that time-series charts can help them make business decisions. Programmers think that data analysis is nothing more than fetching the data of target fields from the database according to certain requirements. These views are all correct, but incomplete. The really useful data analysis is not only to present charts and numbers, but also to fully combine data insights with the business knowledge, which is meaningful to add values to the business. Understanding some basic statistics knowledge is helpful to discover insights.
Unreliable Averages
We can often see that many data reports display daily, weekly, or monthly averages, such as the daily average sales of the current month, the monthly average number of visits last year, and so on. The statistics of the average value will be helpful for some specific situations, such as the time of getting up every morning, and the offset of hitting the target. But more often, you are likely to be skeptical about the average value, because it fluctuates up and down quite a lot, and the fluctuation range is not small. The root cause here comes from the Non Linear Distribution in the real world. The distribution of websites' response time, number of web page visits, and stock trend is non-linear. In these non-linear distributions, the average value fails because there are a large number of outliers that cause the average value to be seriously skewed. As shown in the figure below, for normal distribution or gaussian distribution, it is linear, so the average value is at the peak in the middle; But for gamma distribution, given that it is a nonlinear, its average value seriously deviates from its peak, and when there are more outliers, its average value will further deviate from its central position.

Therefore, for these non-linear distributions, the average value is not a reasonable indicator, but we can use median instead to describe its overall distribution. There are many tools to deal with non-linear distributions, one of which is the Box Plot. As shown in the figure below, the two distributions are abstracted into a box and several lines, where the box center line is the median, and the edges are first-quartile and third-quartile lines. In this way, it is not necessary to do too much complicated analysis to quickly get a general idea about the distribution.






