Data Batching, Streaming and Processing

The IT industry likes to treat data like water. There are clouds, lakes, dams, tanks, streams, enrichments, and filters.

Data Engineers combine Data Streaming and Processing into a term/concept called Stream Processing. If data in the stream are also Events, it is called Event Stream Processing. If data/events in streams are combined to detect patterns, it is called Complex Event Processing. In general, the term Events refers to all data in the stream (i.e., raw data, processed data, periodic data, and non-periodic data).

The examples below help illustrate these concepts:

Water Example:

Let’s say we have a stream of water flowing through our kitchen tap. This process is called water streaming.

We cannot use this water for cooking without first boiling the water to kill bacteria/viruses in the water. So, boiling the water is water processing.

If the user boils the water in a kettle (in small batches), the processing is called Batch Processing. In this case, the water is not instantly usable (drinkable) from the tap.

If an RO (Reverse Osmosis) filtration system is connected to the plumbing line before the water streams out from the tap, it’s water stream processing with filter processors. The water stream output from the filter processors is a new filtered water stream.

A mineral-content quality processor generates a simple quality-control event on the RO filtered water stream (EVENT_LOW_MAGNESIUM_CONTENT). This process is called Event Stream Processing. The mineral-content quality processor is a parallel processor. It tests several samples in a time window from the RO filtered water stream before generating the quality control event. The re-mineralization processor will react to the mineral quality event to Enrich the water. This reactive process is called Event-Driven Architecture. The re-mineralization will generate a new enriched water stream with proper levels of magnesium to prevent hypomagnesemia.

Suppose the water infection-quality control processor detects E-coli bacteria (EVENT_ECOLI), and the water mineral-quality control processor detects low magnesium content (EVENT_LOW_MAGNESIUM_CONTENT). In that case, a water risk processor will generate a complex event combining simple events to publish that the water is unsuitable for drinking (EVENT_UNDRINKABLE_WATER). The tap can decide to shut the water valve reacting to the water event.

Water Streaming and Processing generating complex events

Data Example:

Let’s say we have a stream of images flowing out from our car’s front camera (sensor). This stream is image data streaming.

We cannot use this data for analysis without identifying objects (person, car, signs, roads) in the image data. So, recognizing these objects in image data is image data processing.

If a user analyses these images offline (in small batches), the processing is called Batch Processing. In the case of eventual batch processing, the image data is not instantly usable. Any events generated from such retrospective batch processing are too late to react.

If an image object detection processor connects to the image stream, it is called image data stream processing. This process creates new image streams with enriched image meta-data.

If a road-quality processor generates a simple quality control event that detects snow (EVENT_SNOW_ON_ROADS), then we have Event Stream Processing. The road-quality processor is a parallel processor. It tests several image samples in a time window from the image data stream before generating the quality control event.

Suppose the ABS (Anti-lock Braking Sub-System) listens to this quality control event and turns on the ABS. In that case, we have an Event-Driven Architecture reacting to Events processed during the Event Stream Processing.

Suppose the road-quality processor generates snow on the road event (EVENT_SNOW_ON_ROAD), and a speed-data stream generates vehicle speed data every 5 seconds. In that case, an accident risk processor in the car may detect a complex quality control event to flag the possibility of accidents (EVENT_ACCIDENT_RISK). The vehicle’s risk processor performs complex event processing on event streams from the road-quality processor and data streams from the speed stream. i.e., by combining (joining) simple events and data in time windows to detect complex patterns.

Data Streaming and Processing generating complex actionable events

Takeaway Thoughts

As you can see from the examples above, streaming and processing (Stream processing) is more desired than batching and processing (Batch processing) because of actionable real-time event generation capability.

Data engineers define data-flow “topology” for data pipelines using some declarative language (DSL). Since there are no cycles in the data flow, the pipeline topology is a DAG (Directed Acyclic Graph). The DAG representation helps data engineers visually comprehend the processors (filter, enrich) connected in the stream. With a DAG, the operations team can also effectively monitor the entire data flow for troubleshooting each pipeline.

Computing leverages parallel processing at all levels. Even with small data, at the hardware processor level, clusters of ALU (Arithmetic Logic Unit) process data streams in parallel for speed. These SIMD/MIMD (Single/Multiple Instruction Multiple Data) architectures are the basis for cluster computing that combines multiple machines to execute work using map-reduce with distributed data sets. The BIG data tools (E.g., Kafka, Spark) have effectively abstracted cluster computing behind common industry languages like SQL, programmatic abstractions (stream, table, map, filter, aggregate, reduce), and declarative definitions like DAG.

We will gradually explore big data infrastructure tools and data processing techniques in future blog posts.

Data stream processing is processing data in motion. Processing data in motion helps generate real-time actionable events.

Career

People have different balanced expectations from themselves at work: money, technology, networking, challenging project, innovative project, travel, title, equities, work-life balance, job satisfaction, exposure, health, happiness, social responsibility, job type, startup, strategy, public image, leadership, and many more. At different points in their career, these parameters would be stacked differently, and rightly so.

In this blog post, I am sharing good career advice that I received for my career and growth.

Strive for excellence, use your strengths

Pursuit to excellence is like a pursuit of happiness. The bar keeps going up. It’s not the destination but the more meaningful journey. I had the privilege to work with leaders that would ask me to focus on my strengths – demonstrated and non-demonstrated (i.e., potential). Acknowledge and improve areas that you are weak, but not at the cost of your strengths.

Never let anybody dent your confidence. Always create value for your customers and business.

Seek out your dream job/role

It’s deliberately reaching out for discomfort. Doing a job that you are good at is great for the employer and not necessarily suitable for you.

You may be good in engineering but seek discomfort in project management. You may be good at project management but seek discomfort in product management. You may be good at product management but seek discomfort in engineering.

Lateral is the way to grow. Lateral is the way up.

Great leaders encourage you to apply to jobs/roles that give you another milestone to cherish in your life. Mediocre managers leverage your strengths only for the current job/role.

Posted job descriptions are never a good representation of the role demands. Always “talk” to the hiring leader.

Specialize or Diversify? Do it well

If you desire to seek specialization – go after it. Specialization in any subject requires you to spend significant time to acquire the competency/skill and practice. Don’t spread yourself too thin; pick an area and go deep.

If you desire to seek diversification – go after it. Diversification will require that you build networks and teams; you rely on others (in your network or group) but remain accountable. Connect, listen (not hear), and act.

If you desire specialization and diversification – go after it. Some people have successfully navigated both.

Some started with diversification and then specialized. Others began with specialization and then diversified. There is no career recipe; make your recipe.

Performance, Image, Exposure

Jobs/Roles demand performance, and growth requires exposure to new people and projects. Volunteer to work on initiatives that give you more exposure (work/life). Volunteering to do more when you have a demanding job/role requires you to stretch, work smarter (manage time), and delegate.

Stress: Circle of Control, Influence, & Concern

Stress is good for growth.

It’s widely believed that as you go up a corporate ladder, your circle of control gets bigger. However, the reality is that your circle of influence gets more significant, and the circle of control relatively shrinks.

The circles of concern & influence are the primary driver of “stress” in job/roles. It’s also critical to understand that your circle of control could be the primary driver of “stress” for your peers and teams.

If you understand your abilities to control or influence the outcome, you don’t need to eat stress for breakfast.

Treat People Like People

Treat others like you would like them to treat you.

Don’t treat people like “resources.” People are not like a CPU with a fixed capacity.

People have infinite capacity, and capacity increases when they are motivated. If they find inspiration, then you get unbounded capacity.

People get burnt out; they need time to relax, so do you.

Health, Family, Work

If you lose health, you cannot take care of your family or do an excellent job at work. If you lose work, then you cannot support your family. Family is always there for you – in grief and happiness. You can’t afford to lose unconditional family love. So, the priority order has to be – health, family, and work.

Exercise for 45 minutes every day, even if that is a simple morning brisk walk. This “me” time helps you recharge.

Don’t treat your work colleagues like your family. Treat them like your team. Don’t treat your family like your team.

Final Thoughts

Choose to ignore advice that does not make sense, and consider it your common sense to use the advice that makes sense. Please don’t make people your role models; choose to cherish their actions or ideas that made them role models.

Triage Prioritization

In the last blog post, we talked about balanced prioritization. It’s excellent for big-picture-type things like product roadmaps, new technology introduction, and architecture debt management. The balanced prioritization technique helps build north stars (guiding stars) and focus; however, something else is needed when the rubber hits the road. That something is triage prioritization, where people have to “urgently” prioritize to optimize the whole.

Agile pundits like to draw similarities, and this table is some food for thought:

TypesOther Names AOther Names BOther Names C
Balanced PrioritizationDeliberate
Prioritization
BIG ‘P’ Prioritization“Strategic” Prioritization
Triage PrioritizationJust-in-time (JIT) PrioritizationSmall ‘p’ prioritization“Tactical” Prioritization
Synonyms (Kind-of Similar)

The table below talks about some desired properties of the two:

Expected PropertiesBalanced PrioritizationTriage Prioritization
ValuesValues multiple viewpoints (customer, leadership, architect, team, market) to prioritize. Consensus driven.Values newly available learnings/insights to prioritize items. Values collaboration (disagree-and-commit) to consensus.
PerspectiveStrategic, Long-termTactical, Short-term
Driven-asProcess-first, driven by peoplePeople-first process, driven by people
Time-takenHigh. Analysis-biased.

It’s an involved process to get feedback, consolidate, review, discuss, and reach consensus.
Short. Action-biased.

It’s okay to get it wrong but wrong to waste time in analysis. Values failure to analysis.
AssumptionsLargely Independent of any constraints and uncertainties. Weighted-Risk-Assessment approach. Strives in uncertainties. Uses a mindset of maximizing returns – “effort should not get wasted” and “optimize the whole.”
ToolsRISE Score, Prioritization Matrix, Eisenhower MatrixCommunication (say, listen); and a Kanban to track WIP of priority items.
Properties of prioritization

Okay, so the story 🙂 This story is a personal one:

I have spent most of my career in digitalizing healthcare. The most memorable of this journey was when I spent my time digitalizing emergency care. Emergency care is a hospital department where patients can walk in without an appointment and are assured that they will not be turned away. So, the department gets patients that cannot afford care and patients that need urgent attention. The emergency department is capacity constrained and has to serve a variety of workloads. Most days, it’s just a single emergency, and then there are days where there is a public health emergency or people wheeling in from a significant accident nearby. The department cannot be staffed to handle the worst crisis but is reasonably staffed to handle the typical workload. Then the worst happens – a significant spike in workload – say patients are coming in from a train accident. Some are alive, some are barely alive, and some are dead on arrival. You cannot use a consensus-driven approach to sort these people into an urgent-important matrix or derive a RISE score for each patient. Emergency departments use a scoring system (ESI) where one person identifies the score – “Triage” Nurse; and the team collaborates in the absence of consensus.

I have seen triage nurses shut down the computers, and move to the whiteboard. Computers are great to organize, but organizing is a slow computerized process. Need something faster – whiteboards & instant communication. They don’t need to know the names of the people (patient on bed-2 is a good name). If someone needs help, they shout, and the person who can help offers help.

Commonly heard statements in an emergency care room during high workload emergency:

“That patient is not likely to survive. The damage is beyond repair. Move on, and stop wasting time on this patient.”

“We have only one ventilator. Use this on that younger fellow and not the elderly. We can save more life-hours.”

“That patient is screaming and in pain and will survive if we don’t attend her for the next 45 minutes. She has a fracture. Focus on this patient; our effort here and now can save her leg. We will get back to the screaming one. Somebody shut her up.”

Automated Computerized Workflow can’t respond well to such emergencies. This triage prioritization is a people-first process driven by people. People can increase their capacity – both individually and as a team – in emergencies. People are not like CPUs; they can work at 150% capacity. Capacity is measured not by the elapsed time of effort but outcomes achieved per unit of elapsed time.

End of the day, most people will be saved, and there would be an unfortunate loss. But the triage prioritization values maximizing life hours (optimizing the whole), effort not getting wasted, quick pivot when effort does not yield results (new insights to prioritize), and action bias. The responsibility and authority to prioritize is with the ER care team and stays in those closed doors: the best tools are collaboration and communication (shout, scream, listen, act). When the emergency is behind them, they will retrospect to improve their behavior in an emergency and request for infrastructure that could help them in the future (E.g., that second ventilator could have helped).

Shifting back to software development:

There are parallels in the software team – Tiger team to fix product issues, WAR Room to GO-LIVE, Support Team for 24×7 health of applications in operations during Black Friday, and many more.

While the examples above are like ER teams, triage prioritization also happens in scrum teams executing a well-defined plan. This triage prioritization is hidden from the work tracking boards. When engineers are pair-programming to get something done, many things can go wrong (blind spots) that must be prioritized and fixed. The team cannot fix everything in the sprint duration and carries over some debt. Big Bad debt gets on the boards (some debt remains in code: TODOs). Big Bad debt is prioritized with other big-ticket items using balanced prioritization.

Summary: The outcome quality is directly proportional to triage prioritization, a people-first process driven by people and works best with delegated responsibility and authority.

Some of my friends have argued with me that the story of “mom prioritizing her child’s homework” (last post) is triage prioritization and not balanced prioritization. My friends say, “Mom is the “Triage Nurse” collaborating with the child, and the time taken to prioritize is short, with learnings from execution fed back into the prioritization process.”. There is a grayscale (balanced-triage), and my only argument against the statement above is that mom is not working in a capacity-constrained environment of uncertainties. If she had to choose between a burning car and her child, the choice is obvious.

Balanced Prioritization

In an (agile) product development team, everybody has a say about the priority of the backlog features. However, only the product owner decides (vetoes). The product owner has to consider the customer’s perspectives, market trends, leadership vision, architecture enablers, technical debt, team autonomy, and most importantly, her biased opinions. She has to use some processes to prioritize and justify the priorities.

Let’s move from the corporate dimension to the family dimension. Here’s a story that every parent will relate:

Child: Mom – I have too many things to do. I have got Math, English, Hindi, and Kannada homework to complete. Also, I have to prepare for my music exam before the following Monday. I want to watch the newly released “Frozen” movie – my friends have already watched it. I have a play-date today evening – I can’t skip it; I have promised not to miss it. This is too much to do.

Mom: Hey! I want to watch the “Frozen” movie with you too! Let’s do that after your music exam following Monday? I will book the tickets today.

Child: Ok. Yay!!

Mom: When is your homework due?

Child: English and Kannada are due today. Math and Hindi are due tomorrow.

Mom: Ok, let’s look at the homework. Oh, English and Hindi look simple; you have to fill in the blanks. Kannada is an essay that you have to write about “all homework, and no play makes me sad.” So, you will need to spend some time thinking through that 🙂 Math seems to be several problems to solve; let’s do this piecemeal.

Child: Can I start with Math? I love to solve math problems.

Mom: I know. It’s fun to solve math. Let’s just get done with the English first – its due today and simple.

Child: Ok.

<10 minutes later>

Child: Done, Mom! Can I do Hindi? That is simple too.

Mom: Hindi is not due today. It’s easy to get that done tomorrow. Let’s start with your Kannada essay and do some math today.

<30 minutes later>

Child: I have been writing this essay for 30 minutes. It’s boring.

Mom: Ok, solve some math problems then.

<30 minutes later>

Mom: Having fun? Time to complete the Kannada essay; you have to submit it today.

Child: Ok – Grrr.

<30 minutes later>

Child: Done! Phew. I have one more hour before my friend comes. Today’s homework is done. I will loiter around now.

Mom: No, you should practice for your music exam. Only practice makes it perfect. Why don’t you practice for the next one hour, and then play? After your play-date, you can do some more math; you like to do it anyway. However, 8:00 PM is sleep time.

Child: I am tired. Can I loiter around for 15 minutes and then practice music?

Mom: Ok – I will remind you in 15.

<15 minutes later>

Mom: Ready to practice music?

Child: Grr, I was enjoying doing nothing and loitering around.

Mom: Quickly, now finish your music practice before your friend comes. Or. I will ask her to come tomorrow.

Child: Mommy! How can you do that! Ok – I will practice music.

<45 minutes later>

Friend: Hello! Let’s play.

<2 hours later>

Mom: Playtime is over, kids. Have your dinner, and then complete some math; 8:00 PM is sleep time. Remember.

Child: Ok. The playtime was too short.

<completes dinner, completes some more math problems, sleeps>

Mom: Hey, good morning. No school today. It’s Saturday. Finish your remaining homework, and you can play all day. You can start with Hindi or Math. Your choice.

Child: I will do Hindi first. It’s simple. Then math.

<20 minutes later>

Child: Done. Now, I can play, loiter, and do anything?

Mom: Yes, let me know when you want to do one more hour of music practice. We will do that together.

Child: Ok, Mom. Love you!

<After a successful music exam on Monday>

Mom: Let’s watch Frozen.

The story above is balanced prioritization. Mom balances work-play, urgent-important, big-small effort, short-long term, like-dislike, carrot-stick, confidence level, reach, and impact. While she considers priorities, she also allows her child to make some decisions, delegating authority.

Balanced prioritization is an exercise for execution control. There is no use of prioritization without a demand for execution control.

When mom comes to the corporate world, the n-dimensional common sense transforms to the 2-dimensional corporate language: RISE Score, MoSCoW, Eisenhower’s Time Management Matrix, and Prioritization ranking matrix.

Top Prioritization Techniques

Balanced prioritization is part of the backlog grooming process and extends to sprint planning, where the team slices the work items to fit in a sprint boundary.

Balanced prioritization is a continuous process.

In our story, mom did not have to deal with capacity constraints. However, in the corporate world, there are capacity constraints that push for more aggressive continuous real-time prioritization. I will share a (healthcare) story in my next blog.

Data Semantics

The real world is uncertain, inconsistent, and incomplete. When people interact with the real world, data from their inbuilt physical sensors (eyes, ears, nose, tongue, skin) and mental sensors (happiness, guilt, fear, anger, curiosity, ignorance, and many more) get magically converted into insights and actions. Even if these insights are shared & common, the actions may vary across people.

Example: When a child begs for money on the streets, some people choose to give money, others prefer to ignore the child, and some others decide to scold the child. These people have a personal biased context overlooking the fact that it’s a child begging for money (or food), and their actions result from this context.

The people who give money claim that they feel sorry for the child, and parting away little money won’t damage them and help the child eat. The people who don’t give money argue that giving cash would encourage more begging, and a mafia runs it. Some people may genuinely have no money, and others expect the governments (or NGOs) to step up.

Switching context to the technology world:

With IoT, Cloud, and BIG Data technologies, everybody wants to collect data, get insights, and convert these insights into actions for business profitability. This computerized data and workflow automation system approximates an uncertain, inconsistent, and incomplete real-world setup. Call this IoT, Asset Performance Management (APM), or a Digital Twin; the data to insights to actions process is biased. Automating a biased process is a hard problem to solve.

It’s biased because of semantics of the facts involved in the process.

semantics [ si-man-tiks ]

“The meaning, or interpretation of the meaning, of a fact or concept”

Semantics is to humans as syntactic is to machines. So, a human in the loop is critical to manage semantics. AI is changing the human-in-the-loop landscape but comes with learning bias.

Let’s try some syntactic sugar to decipher semantics.

Data Semantics, Data Application Semantics

Data semantics is all about the meaning, use, and context of Data.

Data Application Semantics is all about the meaning, use, and context of Data and application agreements (contracts, methods).

Sounds simple? Not to me. I had to read that several times!

Let’s dive in with some examples:

Example A: A Data engineer claims: “My Data is structured (quantifiable with schema). So, AI/BI applications can use my Data to generate insights & actions”. Not always true. Mostly not.

Imagine a data structure that captures the medical “heart rate” observation. The structure may look like {“rate”: 180, “units”: ‘bpm’} with a schema that defines the relevant constraints (i.e., the rate is a mandatory field and must be a number >=0 expressed as beats per minute).

An arrhythmia detection algorithm, analyzing this data structure might send out an urgent alarm – “HELP! Tachycardia”, and dial 102 to call an ambulance. The ambulance arrives to find that the person was running on a treadmill, causing a high heart rate. This Data is structured but “incomplete” for analysis. The arrhythmia detection algorithm will need more context than the rate and units to raise the alarm. It will need context to “qualify” and “interpret” the “heart-rate” values. Some contextual data elements could be:

  1. Person Activity: {sleeping, active, very active}
  2. Person Type: {fetal, adult}
  3. Persons age: 0-100
  4. Measurement location: {wrist, neck, brachial, groin, behind-knee, foot, abdomen}
  5. Measurement type: {ECG, Oscillometry, Phonocardiograpy, Photoplethysmography}
  6. Medications-in-use: {Atropine, OTC decongestants, Azithromycin, …}
  7. Location: {ICU, HDU, ER, Home, Ambulance, Other}

Let’s look at this example from the “semantics” definition above:

  1. The meaning of “heart rate” is abstract but consistently understood as heart contractions per minute.
  2. The “heart rate” observation is used to build an arrhythmia detection application.
  3. Additional data context required to interpret “heart-rate” is Activity, Person Type, Person Age, Measurement Location, Measurement Type, Medications-in-use, and Location. This qualifying context is use-specific. An application to identify the average heart-rate range in a population by age intervals needs only the Person Age.
  4. The algorithm’s agreement (contract = what?) is to “Detect arrhythmias and Call Ambulance in case of ER care”
  5. The algorithm’s agreement (method = how?) is not well defined. A competing algorithm may use AI to make a better prediction to avoid false alarms. This is similar to our beggar child analogy, where the method of the people to derive insight differed, resulting in different actions.

Example B: Another familiar analogy to help understand “meaning,” “use,” “context,” and “agreement” is to look at food cooking recipes. Almost all these recipes have the statement “add salt to taste.”

  • The meaning of “Salt” is abstract but consistent. Salt is not sugar! 🙂 It’s salt.
  • Salt is used to make the food tasty.
  • Additional data context required to interpret “Salt” is the salt-type {pink salt, black salt, rock salt, sea salt}, users salt toleration level {bland, medium, saltier}, users BP, and users-continent {Americas, Asia, Europe, Africa, Australia}.
  • The agreement (contract = what?) is to “Add Salt.”
  • The agreement (method = how?) is not well defined. Depending upon the chef, she may have a salt type preference with variations to the average salt toleration levels. For good business reasons, she may add less salt than her salt toleration level and serve extra salt to allow the customer to adjust the food taste according to the customer’s salt toleration levels.

In computerized systems, physical-digital data modeling can achieve data semantics (meaning, use, context). It’s much harder to achieve data application semantics (data semantics + agreements). Data Interpretation is subject to the method, and associated bias.

So, to interpret data, there must be a human in the loop. Not all people infer equally. Thus, semantics leads to variation in insights. Variation in insights leads to variation in actions.

Diving into Context – It’s more than Qualifiers

Alright, I want further peel the “context” onion. Earlier, we said that “context” is used to “qualify” the data. There is another type of context that “modifies” the data.

Let’s go back to our arrhythmia detection algorithm (Example A). We have not captured and sent any information about the patient’s diagnosis to the algorithm. The algorithm does not know whether the high heart rate is due to Supra-ventricular Tachycardia (electric circuit anomaly in the heart), Ventricular Tachycardia (damaged heart muscle and scar tissue), or food allergies. SVT might not require an ER visit, while VT and food allergies require an ER visit. Let’s say our data engineers capture this qualifying information as additional context:

{prior-diagnosis: [], known-allergies:[]}

Great. We have qualifying context. So, what does diagnosis = [] mean? The patient does not have SVT and VT? No, Not true. It means that the doctors have not tested the patient for the condition or not documented a negative result of the test in the data system. It doesn’t mean that the patient has neither SVT nor VT. So, we are back to square one. Now, let’s say that we have a documented prior diagnosis:

{prior-diagnosis: [VT], known-allergies: []}

Ok, even with this Data, we cannot confirm that VT causes a high heart rate. It could be due to undocumented/untested food allergies or yet undiagnosed SVT. This scenario calls for data “modifiers.”

{prior-diagnosis-confirmed: [VT], prior-diagnosis-excluded: [SVT], known-allergies-confirmed: [pollen, dust], known-allergies-excluded: [food-peanuts]}

The structure above has more “semantic” sugar. There is a diagnosis-excluded: [SVT] modifier as a “NOT” modifier on “diagnosis.” This modifier helps to safely ignore SVT as a cause.

Summary

Going from data to insights to actions is challenging due to “data semantics” and “data application semantics.”

Modeling all relationships between real-world objects and capturing context mitigates “data semantics” issues. Context is always use-specific. The context may still have “gaps,” and inferencing data with context gaps lead to poor-quality insights.

“Data application semantics” is a more challenging problem to solve.

The context must “qualify” the data and “modify” the qualifiers to improve data semantics. This context “completeness” requires collecting good quality data at source. More than often, an human data analyst goes back to the data source for more context.

When technology visionaries say “We bring the physical and digital together” in the IT industry, they are trying to solve the data semantics problem.

For those in healthcare, the words “meaning” and “use” will trigger the US government’s initiative of “meaningful use” and shift to a merit-based incentive payment system. To achieve merit-based incentives, the government must ensure that the data captured has meaning, use, and context. The method (= how) used by the care provider to achieve the outcome is important but secondary. This initiative also serves as a recognition that data application semantics are HARD.

Enough said! Rest.

Data Measurement Scale and Composition

In the parent blog post, we talked about data terms: “Structured, Unstructured, Semi-structured, Sequences, Time-series, Panel, Image, Text, Audio, Discreet, Categorical, Numerical, Nominal, Ordinal, Continuous and Interval”; let’s peel this onion.

Some comments that I hear from engineers/architects:

“My Data is structured. So, it’s computable.”Not true. Structure does not mean that Data is computable. In general, computable applies to functions, and when used in the context of data, it means quantifiable (measurable). Structured data may contain non-quantifiable data types like text strings.

“All my Data is stored in a database. It’s structured data because I can run SQL queries on this data”Not always true. Databases can store BLOB-type columns containing structured, semi-structured, and unstructured data that SQL cannot always query.

“Data lakes store unstructured data, and this data is transformed into structured data in data warehouses”Not Really. Data lakes can contain structured data. Data pipelines extract, transform, and load data into data warehouses. The data warehouse is optimized for multi-dimensional data queries and analysis. Inability to execute queries in data lakes does not imply that Data in the lake does not have structure.

Ok, it’s not as simple as it appears on the surface. Before we define the terms, let’s look at some examples.

Example A: The data below can be classified as structured because it has a schema. The weight sub-structure is quantifiable. “Weight-value” is numeric and continuous type data type, and “weight-units” is categorical and nominal data type.

nameweight-valueweight-units
Nitin79.85KG
Example A: Panel Data
FieldMandatoryData TypeConstraints
nameYesStringNot Null
Length < 100chars
weight-valueYesFloat> 0
weight-unitsYesEnum{KG, LBS}
Example A: Schema & Constraints

Example B: The data below can be classified as semi-structured because it has a structure but no schema or constraints. Some schema elements can be derived, but the consumer is at the mercy of the producer. The value of weight can be found in “weight-value”, “weight”, or “weight-val” fields. Given the sample, the consumer can infer that the value is always numerical and continuous data type (i.e., float). The vendor of the weighing machine may decide to have their name captured optionally. The consumer will also have to transform “Kgs,” “KG,” and “Kilograms” into a common value before analyzing the data.

Data Instance AData Instance BData Instance C
{
“name”: “Nitin”,
“weight-units”: “Kgs”,
“weight-value”: 79.85,
“vendor”: “Apollo”
}
{
“name”: “Nitin”,
“weight-units”: “KG”,
“weight”: 79.85,
“vendor-name”: “Fitbit”
}
{
“name”: “Nitin”,
“weight-units”: “Kilograms”,
“weight-val”: 79.85,
“measured-at”: “14/08/2021”
}
Example B: JSON Data

Example C: A JPEG file stored on a disk can be classified as structured data. Though the file is stored as binary, there is a well-defined structure (see table below). This Data is “structured,” but the image data (SOS-EOI) is not “quantifiable” and loosely termed as “unstructured.” With the advance of AI/ML, several quantifiable features can be extracted from image data, further pushing this popular unstructured data into the semi-structured data space.

JFIF file structure
SegmentCodeDescription
SOIFF D8Start of Image
JFIF-APP0FF E0 s1 s2 4A 46 49 46 00 ...see below
JFXX-APP0FF E0 s1 s2 4A 46 58 58 00 ...optional,
… additional marker segments
SOSFF DAStart of Scan
compressed image data
EOIFF D9End of Image
Example C: JPEG Structure (courtesy: Wikipedia)

Example D: The Text below can be classified as “Unstructured Sequence” data. The English language does define a schema (constraint grammar); however, quantifying this type of data for computing is not easy. Machine learning models can extract quantifiable features from text data. In modern SQL, machine learning is integrated into queries to extract information from “unstructured” data.

I must not fear. Fear is the mind-killer. Fear is the little death that brings total obliteration. I will face my fear. I will permit it to pass over me and through me. And when it has gone past, I will turn the inner eye to see its path. Where the fear has gone, there will be nothing. Only I will remain.

So, the lines are not straight 🙂 Given this dilemma, let’s further define these terms, with more examples:

Quantifiable Data is measurable Data. Computing is easy on measurable data. There are two different types of measurable data – Numerical and Categorical. Numerical Data types are quantitative, and categorical Data types are qualitative.

Numerical data types could be either discreet or continuous. “People-count” cannot be 35.3, so this Data type is discreet. “Weight-value” is always approximated to 79.85 (instead of 79.85125431333211), and hence this Data type is continuous.

Categorical data type could be either ordinal or nominal. In “Star-rating,” a value of 4 is higher than 3. Hence, the “star rating” data is ordinal as there is an established order in ratings. The quantitative difference between ratings is not necessarily equal. There is no order in “cat,” “dog,” and “fish”; hence, “Home Animals” is nominal data type.

Parent CategoryChild CategoryExample
NumericalDiscreet{ “people-count”: 35 }
NumericalContinuous{ “weight-value”: 79.85 }
CategoricalOrdinal5 STAR RATING = {1,2,3,4,5}
{“star-rating”: “3”}
CategoricalNominalHome Animals = {“cat”, “dog”, “fish”}
{“animal-type”: “cat”}
Quantifiable Data (Quantitative and Qualitative)

Non-Quantifiable Data is Data where the measurable features are implicit. The Data is rich in features, but the features need to be extracted for analysis. AI is changing the game of feature extraction and pattern-recognition for such data. The three well-known examples in this category are Images, Text, and Audio. The latter two (Text and audio) are domains of natural language processing (NLP), while Images are the domain of computer vision (CV).

Quantifiable and Non-Quantifiable data can be composed together into structures. The composition may be given a name. Example: A “person” is a composite data type of quantifiable (i.e., weight) and non-quantifiable (i.e., name) data types.

When data is composed with a schema, it is loosely called “structured” data. Any composition without a schema is loosely called “semi-structured” data.

Structured or Semi-structured data with non-quantifiable fields is called unstructured data. In this spirit, example C is unstructured. Also, the quote about data lakes storing “unstructured” Data is true. The data might have a structure with schema but cannot be queried in place without loading into a data cube in the warehouse. The lines blur when modern lake-houses that query data in-place at scale.

Data can also be composed together into “Collection” data types. Sets, Maps, Lists, and Arrays are examples of some collections. “Collections” with item order and homogeneity are called sequences. Movies and Text are sequences (arrays of images and words). In most cases, all data generated in a sequence is usually from the same source.

Sequences ordered by time are called time-series sequences, or just time-series for short. Sensor data that is generated periodically with a timestamp is an example of time-series data. Time-series data have properties like trend and seasonality that we will explore in a separate blog post. A camera (sensor) generates a time-series image data feed with a timestamp on every image. This feed is a time-series sequence.

Some visuals will help to clarify this further:

Data by Measurement Class
Data by Structure (composition) Class

JSON and XML are data representations that come with or without schema. It’s incorrect to call all JSON documents semi-structured, as they might originate from a producer that uses well-defined schema and data typing rules.

Data Compositions

Hope this post helps to understand the “data measurement and composition vocabulary“. You can be strict or loose about classifying data by structure based on context—however, it’s critical to understand the measurement class.

Only measurable data generates insights.

After all that rant, let’s try to decipher “logs” data-type.

  1. “Performance Event Logs” generated by an application with fixed fields like {id: number, generated-by: application-name, method: method-name, time: time-since-epoch} is composed of quantifiable fields and constrained schema. So, it belongs to the “Structured Data” class.
  2. “Troubleshooting Logs” generated by an application with fields like {id: number, generated-by: application-name, timestamp: Date-time, log-type: <warn, info, error, debug>, text: BLOB, +application-specific-fields} is composed of quantifiable and non-quantifiable fields, without a constraining schema. Some applications may add additional fields like – API name, session-id, and user-id. Strictly, this is “unstructured” data due to the BLOB but loosely called “semi-structured” data.

Measurement-based data type classification and composition of types into structures do not convey semantics. We will cover semantics in our next blog!

Agile: “Teaming” + “Collaboration”

How to improve the “collaboration” in an agile “team”? Collaboration is a critical ingredient in the pursuit of excellence.

“Agile is for the young who can sprint. What’s the minimum age to apply for a manager? Managers don’t need to sprint.” – Software Engineer

“I need my personal space and can’t work in a collaborative space. I need to hide sometimes. Managers are in a cabin; I want to be isolated too.”Software Engineer

“No pair programming, please. I am most productive when I am alone. I listen to my music and code. I will follow the coding guidelines; just let me be with my code. I will work all night and get it done with great quality – promise. Can I WFH?” – Software Engineer

“You must review your code before check-in. Peer review is like brushing your teeth in the morning. It’s hygiene. Do you brush your teeth every day? Like it or not – just like brushing teeth – you have to peer-review your code before check-in” – Coach.

“You don’t go to the gym to only run on a treadmill for cardio. You have to train your back with pullups, chest with dumbbell incline press, shoulders with machine shoulder press, and biceps with dumbbell bicep curls. Whole-body training gives the best results. In the same way, in the agile gym, you have to practice pair-programming, team retrospectives, team backlog grooming, peer-review, and many more. It’s a team sport. We will start you with pair-programming and then gradually introduce other practices” – Coach.

Software Engineers’ perspective is correct: They have been nurtured by the system to be competitive. They did not work in pairs to clear their engineering exams. Study groups were just boredom killers. They compete with other engineers for jobs. They are where they are because of their “individualism” and not their “collaboration” skills. And, the ask from them is to un-learn all of that and “collaborate.” It’s a value conflict.

Coaches’ perspective is correct: Great things have been achieved with collaboration. From “hunting for food” during cave days to “winning a football (soccer) game” requires intense collaboration.

In sports teams, say, cricket teams, some basic instincts kick in to drive collaboration. People quickly self-organize into the batter, bowler, wicket-keeper, and captain. Batters collaborate to seek “runs.” Everybody gives feedback to bowlers. The team claps when anybody fields the ball. They hug and scream. Emotions flow. They celebrate each other – it’s the team and not the individual. And, when this same team comes back to their desk to work, emotions stop, and they continue complaining about pair programming (2 batters running) and peer review (everybody giving feedback to bowlers).

It’s not about process, practices, and tools. It’s about people. In a team context, it’s an identity loss for the individual. It’s a mistake only to celebrate a team and overlook individualism. “Collaboration” shines when “Individualism” is honored. While there is a cup for the team, there is also a man-of-the-match (or woman-of-the-match). So, it has to be “Teaming,” “Collaboration,” and “Individualism.”

It’s not about leveraging digital collaboration tools. It’s about allowing human emotions to flow in the work context using gaming to simulate a sports environment. Example: Leaderboards in adopting an agile practice, Leaderboards in competency development, Leaderboards in customer NPS, and badges for engineer-of-the-team help pump the adrenaline gene. The gaming process should not be too liberal in the rewarding process and allow good/bad/ugly emotions to flow—Mix western-style gaming by rules and eastern-style gaming by shame. Example: In football (soccer),

  1. Western-style: The yellow card to warn a player is gaming by rules.
  2. Eastern-style: The coach pulling out a non-performing player from the field is gaming by shame.

Agile Manifesto: “Individuals and interactions over processes and tools

For engineers: Don’t treat your team at work like your family. Treat them as your sports team. If you are handicapped for life, the people looking after you is your family and not your team. The team will extend financial/emotional support but cannot replace family. The value systems are different.

For coaches: Use gaming. Use agile. Not just the process. People before process.

Digital Career in Technology

“I am in software, so I am digital,” claims a software engineer. It’s a fallacy.

“What is Digital?” Based upon their experience, the audience may answer as software, computers, workflow, automation, social media, or agile.

“I have converted the paper workflow into a form on the computer. We are now paperless. Everything is digitized and neatly stored in the database libraries.” – Software Product Leader. Behind the scenes a user is complaining – “This computerized documentation is slower than paper”

Digitizing is not Digitalizing

“I have applied for a job in a Digital company. They are into IoT, Cloud, and BIG Data. After I joined, it’s no different than any other software company. It’s just fixing bugs in somebody else’s code! and long working hours.”

Digital is not about software

So, here’s my opinion – Digital is about customer experience and consumerism. Technology and Software (Computers, software, agile, AI/ML, workflow, etc.) play a role, but they are just a means to an end.

“Customer” <<>> “Experience” <<>> “Consumerism”

Story of Mrs. Jane Doe going Digital

Mrs. Jane Doe loves cooking. She believes she makes the best chocolate cookies in the world. She wants to go digital – she wants more people to experience and consume her cookies. A good business is a growing business. A friend tells her about “anything-you-want.com,” where there are a million users registered. She can publish her cookies and her location; the platform has delivery partners that will deliver her cookies anywhere in the world. So, she has only to make yummy cookies and not hire expensive and cranky software engineers!

The results were excellent; there were 100 delivery requests on day-1. Mrs. Jane Doe improved the consumption of her service (cookies). When she went back to the site, 60 users had rated her cookies as 5-star, 30 users had rated her cookies as 4-star, and 10 users had rated her cookies as 2-star with comments as “Too sweet and sugary. Avoid”

Hmmm, more consumption means more feedback (experience ranges from good, bad, ugly). So, in her next iteration, she added customization to her cookies to request reduced sweetness. She observed the next 100 orders, and it looks like she has made an incremental improvement. 65 cookies rated as 5-star and 35 cookies rated as 4-star. She had no objective/subjective data to improve her cookies. There were no comments at all.

She had an idea; she published a discount coupon code with the next order; the discount coupon code would be activated only after a feedback comment. After this incremental change, she observed the next 100 orders, and voila! she had comments (at the cost of giving free cookies). The comments ranged from Boring package, Expected more cookies in the package, same looking and tasting cookies without variety, too hard for my teeth, and too mushy and melting. She was now armed with feedback from a poor experience and ready to make more changes. She was determined to improve her rating! A higher rating means more orders!! So, it’s experience and consumerism. Digital is cool.

Digital is a new way of doing business. Well, it’s the old way that is packaged in a new way, with “technology” as an enabler and accelerator.

So, how does this relate to a career in technology?

Modern technology is architected as a set of services. It’s paramount that the consumption and experience of the service are measured. Measurement and feedback improve the service. Feedback could be defects or improvement opportunities, and addressing them enhances the experience and consumption of the service. Collect data about consumption and experience – logs, click-streams, and user feedback circles. Analyze data to improve the service quality attributes – functionality, reliability, scalability, etc. It’s a digital pursuit to improve a service experience and consumption.

This continuous improvement mindset drives digital. The user/customer is at the center, not technology. Technology is applied to improve the services. Don’t just hear them; listen to the user’s feedback. If the user is a critique, you are lucky. It’s an opportunity to improve. Whether you are building a platform or an application, it’s a service with a user/customer that uses the service. Move away from software to service.

It’s a digital economy powered by services. Digital is customer experience and consumerism.

While striving for technology expertise/excellence, focus on users/customers. You can then add “digital” to your CV.

Data Representation

Data Representation is a complex subject. People have built their careers in data representations, and some have lost their sleep. While the parent post refers to binary and non-binary data, the subject of data representations is more complex for a single blog post. If you are as old as me and lived through the data representation standardization, you will understand. If you are a millennial, you can reap the benefits of painful standardization of data structures. Semantic Data is still open for standardization.

What is Data?

Data is a collection of datum. Datum is singular, and Data is plural. In the computer language, “Data” is also widely (loosely) used as a singular.

A datum is a single piece of information (a single fact, a starting point for measurement). A character, a quantity, or a symbol on which computer operations (add, multiply, divide, reverse, flip) are applied. E.g., The character ‘H’ is a datum, and the string “Hello World” is data composed of different datum characters.

From now on, we will call ‘H’ and ‘Hello World’ as Data.

What are Data Types?

Data types are attributes of data that tell the computer the programmer’s intent to use the data. E.g., If the data type is a number, the programmer can add, multiply, and divide the data. If the data type is a character, then the programmer can compose the characters into strings. The operations add, multiply, and divide do not apply to characters.

Computers need to store, compute, and transfer different types of data.

Some common Data types are best described below that illustrate basic and composite data types:

Data TypesExamples
Characters and Symbols‘A’, ‘a’, ‘$’, ‘‘, ‘छ’
Digits0, 1, 2, 3, 4, 5, 6, 7, 8, 9
Integers (Signed and unsigned)-24, -191, 533, 322
Boolean (Binary)[True, False], [1, 0]
Floats (single precision)-4.5f
Doubles (double precision) -4.5d
Composite: Imaginary Numbersa + b*i
Composite: Strings“Heaven on Earth”
Composite: Arrays of Strings[‘Heaven’, ‘on’, ‘Earth’]
Composite: Maps (key-value){‘sad’: ‘:(‘, ‘happy’: ‘:)’}
Composite: Decimal (Fraction)22 / 7
Composite: Enumerations[Violet, Indigo, Blue, Green, Yellow, Orange, Red]
Table of Sample Data Types

What are Data Representations?

Logically, computers represent a datum by mapping it to a unique number and data as a sequence of numbers. This representation makes computing consistent – everything is a number. This mapping is called “Unicode.”

ExampleNumber
(Unicode code points)
HTMLComments
‘A’U+0041&#650x41 = 0d65
‘a’U+0061&#970x61 = 0d97
8U+0038&#560x38 = 0d56
‘ह’U+0939&#23610x939 = 0d2361
Sample Mapping Table

The numbers can themselves be represented in the base of 2, 8, 10, or 16. The human-readable number is base-10, whereas base-2 (binary), base-8 (octal), and base-16 (hexadecimal) are the other standard base systems. The Unicode code points (mappings) above are represented in hexadecimal.

Base-10Base-2Base-8Base-16
0d25
(2*10 + 5)
0b11001
(1*24 + 1* 23 + 0*22 + 0*21 + 1*20)
031
(3*81 + 1*80)
0x19
(1*161 + 9*160)
Base Conversion Table

Computers use base-2 (binary) to store, compute, and transfer data. Computers use base-2 because the electronic gates that make up the computers use binary inputs. Each storage cell in memory can store “one bit,” i.e., either a ‘0’ or a ‘1’. A group of 8 bits is a byte. The Arithmetic Logic Unit (ALU) uses a combination of AND, OR, NAND, XOR, NOR gates for mathematical operations (add, subtract, multiply, divide) on binary (base-2) representation of numbers. In modern memory systems (SSDs), each storage cell can store more than one bit of information. These are called MLCs (Multi-level cells). E.g., TLCs store 3 bits of information – or – 8 (23) stable states. This MLC helps to build fast, big, and cheap storage.

Historically, there have been many different character sets. E.g., ASCII for English, Windows-1252 (expanded ASCII) used by windows-95 systems to represent new characters and symbols. However, modern computers use the Unicode character set for (structural) interoperability between computer systems. The current Unicode (v.13) character set has 143,859 unique code points and can expand to 1,114,112 unique code points.

While all the characters in a character set can be mapped to numbers, precision point numbers (floats, doubles) are represented in the computers differently. They are represented as a composite of a sign, mantissa (significant), and exponent:

± (mantissa) * 2exponent

DecimalBinaryComment
1.51.11 * 20 + 1 * 2-1
33.25100001.011 * 25 + 0 * 24 + 0 * 23 + 0 * 22 + 0 * 21 + 1 * 20 + 0*2-1 + 1*2-2
Binary Representation of Decimal Numbers

The example below shows how 33.25 is converted to a float (single precision) representation – 1 sign bit, 8 exponent bits, 23 mantissa bits:

Convert 33.25 to Binary100001.01
Normalized Form(-1)0 * 1.0000101 * 25
[ (-1)s * mantissa * 2exponent ]
Convert exponent using biased notation
Represent decimal as binary
5 + 127 = 13210 = 1000 01002
Normalize the mantissa
Adjust to 23 bits by padding 0s
000 0101 0000 0000 0000 0000
Represent the 4 byte (32 bits)0100 0010 0000 0101 0000 0000 0000 0000
Floats (single precision) represented in 4 bytes

Some scientific computing requires double precision to handle the underflow/overflow issues of single precision. Double precision (64 bits) uses 1 sign bit, 11 exponent bits, and 52 mantissa bits. There are also long doubles that store 128 bits of information. The arithmetic operations (add, multiple) in the electronics are simplified using this binary representation.

Despite great computer precision, some software manages decimals as two separate fields (numerator and denominator) or (before . and after .) as multi-byte integers. These are called “Fraction” or “Decimal” data types and are usually used to store “money” where precision loss is unacceptable (i.e., 20.20 USD is 20 Dollars and 20 cents and not 20 Dollars and 0.199999999999 dollars).

What is Data Encoding?

Encoding is converting data represented by a sequence of numbers from the character set mapping into bits and bytes. The encoding process could be fixed width or variable width and is used for storage/transfer of data. Base64 encoding uses a fixed width (8 bits) encoding to represent 64 ASCII characters (A-Z, a-z, 0-9, special characters). UTF-8 encoding uses a variable width (1-4 bytes) encoding to represent Unicode character set.

TextBase64UTF-8
earthZWFydGg=01100101 01100001 01110010 01110100 01101000
éarthw6lhcnRo11000011 10101001 01100001 01110010 01110100 01101000
Base 64 encoding resulted in fixed-length representations, and UTF-8 resulted in variable-length representations. UTF-8 optimizes for the ASCII Character set and adds additional bytes for other code points. The character ‘ é ‘ is encoded into two bytes ( 11000011 10101001 ). This variable-length encoded sequence can be decoded because there is no conflict during the decoding process.

Base64 is usually used to convert binary data for media-safe data transfer. E.g., A modem/printer would interpret binary data differently (sometimes as control commands), so a base64 encoding is used to convert the data into ASCII to be media-safe. The Data is transferred as binary; however, since the bytes are ASCII (limited binary), the media/printer is not confused. If you observe, base64 has increased the number of bytes after the encoding. Earth (5 bytes) is encoded as ZWFydGg= (8 bytes). The Data is decoded back to binary at the receiver’s end. The example below shows the process:

1earth (40 bits)01100101 01100001 01110010 01110100 01101000
2Buffer to have bits in the multiples of 6 at byte boundaries (48 bits) [48 is 6 bytes and a multiple of 6]01100101 01100001 01110010 01110100 01101000 00000000
3Regroup into 6 bit bytes011001 010110 000101 110010 011101 000110 100000 000000
4Use Base64 table to map to text (see Wikipedia for base64 map)ZWFydGg=
5Convert to binary to send to store or transfer01011010 01010111 01000110 01111001 01100100 01000111 01100111 00111101

There are many different types of encodings – UTF-7, UTF-16, UTF-16BE, UTF-32, UCS-2, and many more.

What is Data Endianness?

Endianness is the order of bytes in memory/storage or transfer. There are two primary types of Endianness: big-endian and little-endian. You might be interested in middle-endian (mixed-endian), and you can google that on your own.

As you can see in the diagram below, the computer may represent the data starting with the most significant byte (0x0A) or the least significant byte (0x0D).

Courtesy: Wikipedia

Most modern computers are little-endian when they store multi-byte data. Networks are consistently big-endian. So, little-endian memory dumps have to arrive at the network as big-endian.

Summary: There are many data types – basic (chars, integers, floats) and composite (arrays, decimals). Data is mapped to numbers using a universal character set (Unicode). This Data is represented as a sequence of code points in Unicode and converted into characters (or bits/bytes) using an encoding process. The encoding process can be fixed-length (E.g., Base64, UTF-32) or variable length (UTF-8, UTF-16). Computers can be little or big-endian. Modern CISC computers (Intel x86) are little-endian, and RISC computers (ARM Processors) are big-endian. Networks are always big-endian.

Tips/Tricks: Stick to Unicode character set and UTF-8 encoding scheme. Use Base64 to transfer data to be media-safe (e.g., base64 encoding of strings in HTTP URLs to make them URL-safe). Using a modern programming language (E.g., Java) abstracts you from the Endianness. If you are an embedded engineer programming in C, you need to develop code to be Endianness safe (e.g., type casts and memcpy).

Even with all this structure, we cannot convey meaning (semantics). An ‘A’ for the computer is always U+0041. If the programmer wants to transfer ‘A,’ ‘A,’ or ‘A,’ more information is encoded for the receiver to interpret. More on that in future blogs.

This one was too long even for me!

Agile Team Composition – Inequality Lens

This is a very opinionated post. More opinionated than some of my previous posts. This post is not about roles in a team (scrum master, product owner, developer, tester) or supporting structures (product management, architecture, DevOps) or in/outsourcing members. This post is also not about the skill homogeneity (homogenous, heterogenous) of an agile team. This post is about the inequality (experience and salaries) of team members within an agile team.

It’s common sense that the team should be composed of skills required to do the job and roles to perform functions. These two are necessary ingredients for a good scrum team.

If you are building a data lake, you need data engineers (competencies/skills). But, data engineer experience ranges from 1-year experience to 20 years of experience. The salary ranges from 5L INR/USD to 45L INR/USD. So, how do we compose teams?

Some unwritten industry rules:

Rule A: The more experienced you are, the expectation moves from being hands-on on a single project to a mentor/coach for multiple projects. A mentor/coach competency is different from engineering a product (hands-on) competency. Adding to the irony, nobody respects a coach who is not hands-on (unlike a sports coach). Salary expectations from experienced individuals also drive this.

Rule B: The more experienced you are, the expectation moves from being a developer/tester to a scrum master, lead engineer, or architect. Many engineers hop jobs to seek out these opportunities. It’s a crime to be a developer/tester for life. The industry critically judges life-long developers/testers (there is nothing wrong with it if your passion is to build). All engineers face the dilemma of salary growth driven by opportunities in contrast to their core skills and passions. That’s life.

Rule C: The less experienced you are, the industry wants to pay you less than your experienced counterparts, irrespective of your skills and credentials. The expectation is that you are a worker bee and not a leader bee, regardless of your leadership credentials. There are exceptions, but the norm is to classify you into developer/tester class. The manager says: “Work your way up.” It’s like the harry potter sorting hat @ work automatically sorting you first by experience and then by credentials.

Agile (with its egalitarian view) challenges this status-quo. Treat everybody equally says agile. How?

In reality, a pay disparity within a team auto-magically drives a command-control structure. Salaries are usually an open secret. This new agile egalitarian structure drives people to respect each other as equals on the surface, but not in spirit.

“Who wins? Capitalist or Socialist? The capitalist, of course,” is the shout-out from the management coach. “That’s the only thing that has worked for humanity.”

With this in-spirit inequality, the agile coach commands: “Self-organize yourself.” The two-year-old experienced software engineer is scared to take the (tie-suit) role of product owner, and the (tie-suit) product owner cannot massage her ego enough to do the developer role. This structure is the new corporate caste system.

Critics of agile target this egalitarian view. Committees cannot make decisions. You need an escalation and decision structure with “one” accountable neck to chop.

An example that works: The five founders of a company working with agile principles to self-organize themselves for the company’s success. There is an in-built expectation that the scale of investment (money, time) drives eventual profits.

Counter Example: A software development team with an experience pyramid working with agile principles to self-organize themselves for the group’s success. People will stick to their roles and view team success from the specific role lens that they own. Scrum masters to drive agile values (huh! no they are just trackers), Product owners to bring requirement clarity, architecture owners to bring design clarity, and engineers to build. Agile purists say that self-organizing means pulling and sharing work and has nothing to do with roles. I disagree; there is more to it! Roles define work types. It’s a culture change that is hard to achieve with in-built in-equality.

It’s human nature to accept the new corporate caste system and reject the religious ones.

Somewhere the capitalist is laughing: “Want to make more money? Take risks and Lead. I will invest, and you will still serve me. Ha ha ha. Money makes more money. So, make more money to make more and more money. Structures exist to control, and deliberately unequal. Welcome to my caste system. Do or die

Finally, after all that rant, My opinion: A purist agile egalitarian approach is not practical in our current world-view. A team must be composed of people with an experience pyramid with a minimum expectation of mutual respect. In a mature team composed of members (not driven by salaries and opportunity, but by a shared vision), self-organization is more practical, but it’s not the norm. A shared vision is not a norm; the expectation of a shared vision is. The leader drives the vision, and teams share the responsibility to deliver the vision. Capitalistic values drive new world order where in-built in-equality is tolerated as an acceptable tradeoff.

Some day we will grow out of this one too; or become a capitalist.