Data Aggregation (Map, Filter, Reduce)

Data engineers think in batches!

Thinking in batches reminds me of a famous childhood story.

Once upon a time, a long, long time ago, there was a kind and gentle King. He ruled people beyond the horizon, and his subjects loved him.

One day, a tired-looking village man came to the King and said, “Dear King, help us. I am from a village beyond the horizon. It’s been raining for several days. My village chief asked me to fetch help from you before disaster strikes. It took me five days to walk to the Kingdom, and I am tired but glad that I could deliver this message to you.”

“I am glad that you came for help. I will send Suppandi, my loyal Chief of Defence, to assess the damage and then send help,” said the King. “Suppandi, you have your orders. Now, go. Assess the damage, report to me, and help,” ordered the King.

Suppandi left to the village beyond the horizon on his fastest horse. When he reached the town, the town was flooded, and Suppandi felt the urge to return to the King quickly to inform him about the floods. So, he drove his horse faster and reached the Kingdom in 1/2 day. He went to the King and told him. “Dear King, the village is flooded. I went in a day and came back in 1/2 day to give you this information.”

Suppandi was pleased with himself. However, the King wanted more information. “Suppandi, please tell me whether people in the village have food, are children hurt? What can we do more to help?”

“I will find out, Dear King,” said Suppandi. He left again on his fastest horse. This time he reached in 1/2 day. He figured that people don’t have food, and many children are hurt and homeless. He raced back to the Kingdom. “Dear King, I reached in 1/2 day and came back in another 1/2. The villagers don’t have food to eat, and they are hungry. Several children are hurt and need medical attention,” said Suppandi.

This time the King had more questions. “Dear Suppandi, what did the village chief say? What can we do for him?”

“Dear King, I will find out. Let me leave to the village immediately,” said Suppandi.

Chanakya was eagerly listening in to the conversation. He told Suppandi, “Dear Suppandi, you must be tired. Let me take over. Take some rest.”

Immediately, Chanakya ordered his men to collect food, water, clothes, medicines, and doctors. He asked for the fastest horses, and along with several men and doctors, he left for the village beyond the horizon. When he reached, the town was flooded, and people were on their home terraces. He found several houses destroyed and hungry kids taking shelter under the trees, and many wounded villagers.

He ordered his men to save the villagers skirting the flood, protect all children, feed them, and take them to a safe place. He also called the doctors to attend to the wounds.

The men built a temporary home outside the village to give shelter to the homeless. They waited for a few days for the rain and flood to subside. When it was bright and sunny, Chanakya, his men, and the villagers cleaned the village, re-built the homes, and deposited enough food and grains for six months before saying goodbye.

Chanakya reached the Kingdom and immediately reported to the King. The King was anxious. He said, “Chanakya, you were gone for two weeks with no message from you. I was worried. Did you speak to the village Chief?”

“Dear King, Yes, on your behalf, I spoke to the village chief. I found that the village was flooded, so we rescued all the villagers, attended to the wounded, fed them, re-built their homes, and left food and grains for six months. The people have lost their belongings in flood, but all of them are safe, and they have sent their wishes and blessings for your timely help,” said Chanakya.

The King was pleased. “Chanakya, I should have sent you earlier. You are a batch thinker! Thank you,” said the King.

Suppandi was disappointed. He had worked hard to drive to the village and report to the King as instructed, but Chanakya gets all the praises. To this date, he still does not understand and is hurt.

Most non-data engineers are like Suppandi; they use programming constructs like “for,” “if,” “while,” and “do” on remote data. Most data engineers are like Chanakya; they use the programming constructs like “map,” “filter,” “reduce,” and “forEach.” Programming with data is always functional/declarative, while traditional programming is imperative.

There is nothing wrong with acting like Suppandi; he is the Chief of Defence. But, some cases require Chanakya thinking. In architectural language, Suppandi actions move data to algorithms, and Chanakya actions move algorithms to data. The latter works better when there is a distance and cost-to-travel between data and algorithms.

This difference in thinking is why data engineers use SQL, and traditional engineers use C#/Java. SQL uses declarative commands that are sent to the database to pipeline a set of actions on data. The conventional programming languages have caught up to the declarative programming paradigm by supporting lambda functions (arrow functions), and map/filter/reduce style functions on data collections. The map/filter/reduce style functions allow compilers/interpreters to leverage the underlying parallel compute backbone (the expensive eight-core CPU) or use a set of inexpensive machines for parallel computing. They are abstracting away parallelism from the programmer. The programmer helps the compiler/interpreter to identify speed-improvement opportunities by explicitly programming declaratively.

Mapping

Instead of iterating over a collection one at a time, a map is a function to apply another function to all elements of a collection. The map function may split the collection into parts to distribute to different cores/machines. The underlying collection remains immutable. In general, mapping could mean one-2-one, one-2-many, and many-2-one; and is the process of applying a relation (function) to map an element in the domain with an element in the range. In the case of computing, mapping does not change the size of the collection.

E.g., [1,2,-1,-2] => [1,4,1,4] using the squared relation is a many-2-one mapping

var numbers = [1, 2, -1, -2];
var x = numbers.map(x => x ** 2);
console.log(x);
[1,4,1,4]

E.g., [1,2,-1,-2] => [2,3,0,-1] using the plus one relation is a one-2-one mapping

var numbers = [1, 2, -1, -2];
var x = numbers.map(x => x + 1);
console.log(x);
[2, 3, 0, -1]

E.g., [1,2,-1,-2] using the plus one and squared relation is a one-2-many mapping

var numbers = [1, 2, -1, -2];
var x = numbers.map(x => [x + 1, x ** 2]);
console.log(x);
[[2, 1], [3, 4], [0, 1], [-1, 4]]

E.g., An SQL Example of a one-2-one mapping

SELECT Upper(ContactName)
FROM Customers
MARIA ANDERS
ANA TRUJILLO
ANTONIO MORENO
THOMAS HARDY

Filtering

Instead of iterating over a collection one at a time, a filter is a function to return a subset of elements that match criteria. The filter function may split the collection into parts to distribute to different cores/machines. The underlying collection remains immutable. Examples:

var numbers = [1, 2, -1, -2];
var x = numbers.filter(x => x > 0);
console.log(x);
[1, 2]
SELECT *
FROM Customers
WHERE Country="USA"

Number of Records: 13

CustomerIDCustomerNameContactNameAddressCityPostalCodeCountry
32Great Lakes Food MarketHoward Snyder2732 Baker Blvd.Eugene97403USA
36Hungry Coyote Import StoreYoshi LatimerCity Center Plaza 516 Main St.Elgin97827USA
43Lazy K Kountry StoreJohn Steel12 Orchestra TerraceWalla Walla99362USA
45Let’s Stop N ShopJaime Yorres87 Polk St. Suite 5San Francisco94117USA

Reduce

Instead of iterating over a collection one at a time, a reduce is a function to return a single value. The reduce function may split the collection into parts to distribute to different cores/machines. The underlying collection remains immutable. Examples:

var numbers = [1, 2, -1, -2];
var x = numbers.reduce((sum,x) => sum + x, 0);
console.log(x);
0
SELECT count(*)
FROM Customers
Number of Records: 1
count(*)
91

Pipelining

When multiple actions need to be performed on the data then it’s a norm to pipeline the actions. Examples:

var numbers = [1, 2, -1, -2];
var x = numbers
  .map(x => x + 1) //[2,3,0,-1]
  .filter(x => x > 0) //[2,3]
  .map(x => x ** 2) //[4,9]
  .reduce((sum, x) => sum + x, 0) //13
console.log(x);
13
SELECT Country, Upper(Country), count(*)
FROM Customers
WHERE Country LIKE "A%"        
GROUP BY Country
Number of Records: 2
Country Upper(Country) count(*)
Argentina ARGENTINA 3
Austria AUSTRIA 2

Takeaway

Data Engineers use Chanakya thinking to get work done in batches. Even streaming data is processed in mini-batches (windows). Actions on data are pipelined and expressed declaratively. The underlying compiler/interpreter abstracts away parallel computing (single device, multiple devices) from the programmer.

Think in Batches for Data.

Published by

Unknown's avatar

mallyanitin

A leader! Attracted to creativity and innovation. Inspired by simplicity.

Leave a comment