Data analysts look for descriptors in data to generate insights.
For a Data aggregator, descriptive attributes of data like size, speed, heterogeneity, lineage, provenance, and usefulness are essential to decide the storage infrastructure scale, data life cycle, and data quality. These aggregator-oriented descriptions are black-box perspectives.
For a Data analyst, descriptive statistics, patterns, and relationships are essential to generate actionable insights. These analyst-oriented descriptions are white-box perspectives. The analysts then use inferential methods to test various hypotheses.
Descriptive Statistics
Data analysts usually work with a significant sample of homogenous records to statistically analyze features. The typical descriptive statistics are – measures of location, measures of center, measures of skewness, and measures of spread.
E.g., A 23 member cricket team of three different states has players of the following ages:
Karnataka: [19,19,20,20,20,21,21,21,21,22,22,22,22,22,22,23,23,23,23,24,24,24,25,25]
Kerala: [19,19,20,20,20,21,21,21,22,22,22,22,23,23,23,23,23,24,24,24,24,24,24]
Maharashtra: [19,19,19,19,19,19,20,20,20,20,20,21,21,21,21,22,22,22,23,23,24,24,25]
Numbers represented this way does not help us detect patterns or explain the data. So, it’s typical to see the tabular distribution view:
| AGE | Karnataka | Kerala | Maharashtra |
| 19 | 2 | 2 | 6 |
| 20 | 3 | 3 | 5 |
| 21 | 4 | 3 | 4 |
| 22 | 5 | 4 | 3 |
| 23 | 4 | 5 | 2 |
| 24 | 3 | 6 | 2 |
| 25 | 2 | 0 | 1 |
This distribution view is better. So, we would like to see measures of center for this data. These are usually – MEAN, MEDIAN, and MODE.
- MEAN is the average (Sum Total / # Total)
- MEDIAN is the middle number
- MODE is the highest frequency number
| Measure | Karnataka | Kerala | Maharashtra |
| MEAN | 22 | 22.1 | 21 |
| MEDIAN | 22 | 22 | 21 |
| MODE | 22 | 24 | 19 |
This description is much better. So, we would like to see this graphically to understand the skewness.

The age distribution is symmetrical for Karnataka, skewed to the left for Kerala, and skewed to the right for Maharashtra. The data analyst may infer that Karnataka prefers a good mix of ages, Kerala prefers player experience, and Maharashtra prefers the young.
The data analyst may also be interested in standard deviation, i.e., a measure of spread. The standard deviation symbol is sigma (σ) for a sample and is the MEAN distance from the mean value of all values in the sample. Since a distance can be positive or negative, the distance is squared, and the result is square-rooted.
| Measure | Karnataka | Kerala | Maharashtra |
| Standard Deviation | 1.8 | 1.7 | 1.8 |
In our example, a measure of location (quartiles, percentiles) is also of interest to the data analyst.
| Percentile | Karnataka | Kerala | Maharashtra |
| 25 Percentile | 21 | 21 | 19.5 |
| 50 Percentile | 22 | 22 | 21 |
| 75 Percentile | 23 | 23.5 | 22 |
| 100 Percentile | 25 | 24 | 25 |
The table above shows that the 50 percentile value is the median, and the 100 percentile is the maximum value. This location measure is helpful if the values were scores (like in an exam).
Combining statistics and display to explain the data is the art of descriptive statistics. There are several statistics beyond the ones described in this short blog post that could be useful for data analysis.
Time-series Data Patterns
The time-series data has trends, variations and noise.
- A trend is the general direction (up, down, flat) in data over time.
- Cyclicity variation is the cyclic peaks and troughs in data over time.
- Seasonality variation is the periodic predictability of a peak/trough in data.
- Noise is meaningless information in data.
The diagrams below provide a visual explanation:

“Ice cream sales are trending upward,” claims an excited ice-cream salesman.

“Used Car value is trending downward,” warns the car salesman

“Every business has up and down cycles, but my business is trending upwards,” states a businessman.

“It’s the end of the month, so, Salary and EMI season in user accounts, so the transaction volume will be high,” claims the banker.

“There is some white noise in the data,” declared the data scientist.
Data Relationships
Data analysts seek to understand relationships between different features in a data set using statistical regression analysis.
There could be a causal (cause and effect) relationship or simply a correlation. This relationship analysis helps to build predictors.
A simple measure of linear relationship is the correlation coefficient. The measure is not relevant for non-linear relationships. Correlation coefficient of two variables x and y is is calculates as:
correlation(x, y) = covariance(x, y) / (std-dev(x) * std-dev(y))
It’s a number that in the range [-1,1]. Any number closer to zero implies no correlation, and closer to either extremity means higher linear correlation.
- Negative one (-1) means negatively linearly correlated
- Positive one (1) means positively linearly correlated
- Zero (0) means no correlation
Example: Let’s take this random sample.
| X | Y | Y1 | Y2 | Y3 |
| 1 | 3 | -3 | 83 | -100 |
| 2 | 8 | -8 | 108 | 250 |
| 3 | 15 | -15 | 146 | -50 |
| 4 | 24 | -24 | 98 | 150 |
| 5 | 35 | -35 | 231 | -50 |
| 6 | 48 | -48 | 220 | 155 |
| 7 | 63 | -63 | 170 | -125 |
| 8 | 80 | -80 | 100 | -150 |
| 9 | 99 | -99 | 228 | -12 |
| 10 | 120 | -120 | 234 | 190 |
| X and Y | X and Y1 | X and Y2 | X and Y3 |
| 1 | -1 | 0.6 | 0 |
Visually, we can see that as X increases, Y increases linearly, and Y1 decreases linearly. Hence, the correlation coefficient is positive (1) and negative (-1), respectively. There is no linear relation between X and Y3, and hence, the correlation is 0. The relationship between X and Y2 is somewhere in between with a positive correlation coefficient.

If X is the number of hours bowler practices and Y2 is the number of wickets, then the correlation between the two can be considered positive.
If X is the number of hours bowler practices and Y3 is the audience popularity score, then the correlation between the two can be considered negligible.
If X is the number of years a leader leads a nation, and Y or Y1 is his popularity index, then the correlation between the two can be considered linearly increasing or decreasing, respectively.
Summary
Data analysts analyze data to generate insights. Insights could be about understanding the past or using the past to predict the (near) future. Using statistics and visualization, the data analysts describe the data and find relationships and patterns. These are then used to tell the story or take actions informed by data.