Data Descriptors (Stats, Relations, Patterns)

Data analysts look for descriptors in data to generate insights.

For a Data aggregator, descriptive attributes of data like size, speed, heterogeneity, lineage, provenance, and usefulness are essential to decide the storage infrastructure scale, data life cycle, and data quality. These aggregator-oriented descriptions are black-box perspectives.

For a Data analyst, descriptive statistics, patterns, and relationships are essential to generate actionable insights. These analyst-oriented descriptions are white-box perspectives. The analysts then use inferential methods to test various hypotheses.

Descriptive Statistics

Data analysts usually work with a significant sample of homogenous records to statistically analyze features. The typical descriptive statistics are – measures of location, measures of center, measures of skewness, and measures of spread.

E.g., A 23 member cricket team of three different states has players of the following ages:

Karnataka: [19,19,20,20,20,21,21,21,21,22,22,22,22,22,22,23,23,23,23,24,24,24,25,25]

Kerala: [19,19,20,20,20,21,21,21,22,22,22,22,23,23,23,23,23,24,24,24,24,24,24]

Maharashtra: [19,19,19,19,19,19,20,20,20,20,20,21,21,21,21,22,22,22,23,23,24,24,25]

Numbers represented this way does not help us detect patterns or explain the data. So, it’s typical to see the tabular distribution view:

AGEKarnatakaKeralaMaharashtra
19226
20335
21434
22543
23452
24362
25201
Age Distribution of State Players

This distribution view is better. So, we would like to see measures of center for this data. These are usually – MEAN, MEDIAN, and MODE.

  • MEAN is the average (Sum Total / # Total)
  • MEDIAN is the middle number
  • MODE is the highest frequency number
MeasureKarnatakaKeralaMaharashtra
MEAN2222.121
MEDIAN222221
MODE222419
Measures of Center

This description is much better. So, we would like to see this graphically to understand the skewness.

Measuring skewness

The age distribution is symmetrical for Karnataka, skewed to the left for Kerala, and skewed to the right for Maharashtra. The data analyst may infer that Karnataka prefers a good mix of ages, Kerala prefers player experience, and Maharashtra prefers the young.

The data analyst may also be interested in standard deviation, i.e., a measure of spread. The standard deviation symbol is sigma (σ) for a sample and is the MEAN distance from the mean value of all values in the sample. Since a distance can be positive or negative, the distance is squared, and the result is square-rooted.

MeasureKarnatakaKeralaMaharashtra
Standard Deviation1.81.71.8
Measure of Spread

In our example, a measure of location (quartiles, percentiles) is also of interest to the data analyst.

PercentileKarnatakaKeralaMaharashtra
25 Percentile212119.5
50 Percentile222221
75 Percentile2323.522
100 Percentile252425
Measure of Location

The table above shows that the 50 percentile value is the median, and the 100 percentile is the maximum value. This location measure is helpful if the values were scores (like in an exam).

Combining statistics and display to explain the data is the art of descriptive statistics. There are several statistics beyond the ones described in this short blog post that could be useful for data analysis.

Time-series Data Patterns

The time-series data has trends, variations and noise.

  1. A trend is the general direction (up, down, flat) in data over time.
  2. Cyclicity variation is the cyclic peaks and troughs in data over time.
  3. Seasonality variation is the periodic predictability of a peak/trough in data.
  4. Noise is meaningless information in data.

The diagrams below provide a visual explanation:

“Ice cream sales are trending upward,” claims an excited ice-cream salesman.

“Used Car value is trending downward,” warns the car salesman

Every business has up and down cycles, but my business is trending upwards,” states a businessman.

It’s the end of the month, so, Salary and EMI season in user accounts, so the transaction volume will be high,” claims the banker.

“There is some white noise in the data,” declared the data scientist.

Data Relationships

Data analysts seek to understand relationships between different features in a data set using statistical regression analysis.

There could be a causal (cause and effect) relationship or simply a correlation. This relationship analysis helps to build predictors.

A simple measure of linear relationship is the correlation coefficient. The measure is not relevant for non-linear relationships. Correlation coefficient of two variables x and y is is calculates as:

correlation(x, y) = covariance(x, y) / (std-dev(x) * std-dev(y))

It’s a number that in the range [-1,1]. Any number closer to zero implies no correlation, and closer to either extremity means higher linear correlation.

  • Negative one (-1) means negatively linearly correlated
  • Positive one (1) means positively linearly correlated
  • Zero (0) means no correlation

Example: Let’s take this random sample.

XYY1Y2Y3
13-383-100
28-8108250
315-15146-50
424-2498150
535-35231-50
648-48220155
763-63170-125
880-80100-150
999-99228-12
10120-120234190
Sample Data
X and YX and Y1X and Y2X and Y3
1-10.60
Correlation coefficient

Visually, we can see that as X increases, Y increases linearly, and Y1 decreases linearly. Hence, the correlation coefficient is positive (1) and negative (-1), respectively. There is no linear relation between X and Y3, and hence, the correlation is 0. The relationship between X and Y2 is somewhere in between with a positive correlation coefficient.

Scatter plot X against (Y, Y1, Y2, Y3)

If X is the number of hours bowler practices and Y2 is the number of wickets, then the correlation between the two can be considered positive.

If X is the number of hours bowler practices and Y3 is the audience popularity score, then the correlation between the two can be considered negligible.

If X is the number of years a leader leads a nation, and Y or Y1 is his popularity index, then the correlation between the two can be considered linearly increasing or decreasing, respectively.

Summary

Data analysts analyze data to generate insights. Insights could be about understanding the past or using the past to predict the (near) future. Using statistics and visualization, the data analysts describe the data and find relationships and patterns. These are then used to tell the story or take actions informed by data.

V’s of Data

Volume, Velocity, Variety, Veracity, Value, Variability, Visibility, Visualization, Volatility, Viability

What are the 3C’s of Leadership? “Competence, Commitment, and Character,” said the wise.

What are the 3C’s of Thinking? “Critical, Creative, and Collaborative,” said the wise.

What are the 3C’s of Marketing? “Customer, Competitors, and Company,” said the wise.

What are the 3C’s of Managing Team Performance? “Cultivate, Calibrate, and Celebrate,” said the wise.

What are the 3C’s of Data? “Consistency, Correctness, and Completeness,” said the wise; “Clean, Current, and Compliant,” said the more intelligent; “Clear, Complete, and Connected,” said the smartest.

“Depends,” said the Architect. Technologists describe data properties in the context of use. Gartner coined the 3V’s – Volume, Velocity, and Variety to create hype around BIG Data. These V’s have grown in volume 🙂

  • 5V’s: Volume, Velocity, Variety, Veracity, and Value
  • 7 V’s: Volume, Velocity, Variety, Veracity, Value, Visualization, and Visibility

This ‘V’ model seems like blind men describing an elephant. A humble engineer uses better words to describe data properties.

Volume: Multi-Dimensional, Size

“Volume” is typically understood in three dimensions. Data is multi-dimensional and stored as bytes—a disk volume stores data of all sizes. Data does not have volume! It has dimensions and size.

A person’s record may include age, weight, height, eye color, and other dimensions. The size of the record may be 24 bytes. When a BILLION person records are stored, the size is 24 BILLION bytes.

Velocity: Speed, Motion

Engineers understand the term velocity as a vector and not a scalar.

A heart rate monitor may generate data at different speeds, e.g., 82 beats per minute. I can’t say my heart rate is 82 beats per minute to the northwest. Hence, heart rate is a speed. It’s not heart velocity. I can say that a car is traveling 35 kilometers per hour to the northwest. The velocity of the vehicle is 35KMPH NW.

Data does not have direction; hence it does not have velocity. Data in motion has speed.

Variety: Heterogeneity

The word variety is used to describe differences in an object type, e.g., egg curry varieties, pancake varieties, sofa varieties, tv varieties, image data format varieties (jpg, jpeg, bmp), and data structure varieties (structured, unstructured, semi-structured). Data variety is abstract and is a marketecture term.

Heterogeneity is preferred because it explicitly states that:

  1. Data has types (E.g., String, Integer, Float, Boolean)
  2. Composite types are created by composing other data types (E.g., A Person Type)
  3. Composite types could be structured, unstructured, or semi-structured (E.g., A Person Type is semi-structured as the person’s address is a String type)
  4. Collections contain the same or different data types.
  5. Types, Composition, and Collections apply to all data (BIG or not).

Veracity: Lineage, Provenance

Veracity means Accurate, Precise, and Truthfulness.

Let’s say that a weighing scale reports the weight of a person as 81.5 KG. Is this accurate? Is the weighing scale calibrated? If the same person measures her weight on another weighing scale, the reported weight might be 81.45 KG. The truth may be 81.455 KG.

Data represent facts, and when new facts are available, the truth may change. Data cannot be truthful; it’s just facts. Meaning or truthfulness is derived using a method.

Lineage and provenance meta-data about Data enables engineers’ to decorate the fact with other useful facts:
1. Primary Source of Data
2. Users or Systems that contributed to Data
3. Date and Time of Data collection
4. Data creation method
5. Data collection method

Value: Useful

If Data is a bunch of facts, how can it be valuable? Understandably, the information generated from data by analyzing the facts is valuable. Data (facts) can either be useful to create valuable information or useless and discarded. We associate a cost to a brick and a value to a house. Data is like bricks used to build valuable information/knowledge.

Summary

I did not go into every V, but you get the drill. If an interviewer asks you about 5V’s in an interview, I request you to give the standard marketecture answer for their sanity. The engineer’s vocabulary is not universal; technical journals publish articles in the sales/marketing vocabulary. As engineers/architects, we have to remember the fundamental descriptive properties of data so that the marketecture vocabulary does not fool us. However, we have to internalize the marketecture vocabulary and be internally consistent with engineering principles.

It’s not a surprise that Gartner invented the hype cycle.