Data about Data

As a Data Engineer, I want to be able to understand the data vocabulary, so that I can communicate about the data more meaningfully and find tools to deal with the data for computingData Engineer

Let’s start with this: Binary Data, Non-binary Data, Structured Data, Unstructured Data, Semi-structured Data, Panel Data, Image Data, Text Data, Audio Data, Categorical Data, Discreet Data, Continuous Data, Ordinal Data, Numerical Data, Nominal Data, Interval Data, Sequence Data, Time-series Data, Data Transformation, Data Extraction, Data Load, High Volume Data, High Velocity Data, Streaming Data, Batch Data, Data Variety, Data Veracity, Data Value, Data Trends, Data Seasonality, Data Correlation, Data Noise, Data Indexes, Data Schema, BIG Data, JSON Data, Document Data, Relational Data, Graph Data, Spatial Data, Multi-dimensional Data, BLOCK Data, Clean Data, Dirty Data, Data Augmentation, Data Imputation, Data Model, Object (Blob) Data, Key-value Data, Data Mapping, Data Filtering, Data Aggregation, Data Lake, Data Mart, Data Warehouse, Database, Data Lakehouse, Data Quality, Data Catalog, Data Source, Data Sink, Data Masking, Data Privacy

Now let’s go here: High volume time-series unstructured image data, High velocity semi-structured data with trends and seasonality without correlation, High volume Image data with Pexels Data source masked and stored in Data Lake as the Data Sink.

The vocabulary is daunting for a beginner. These 10 categories (ways of bucketizing) would be a good place to start:

  1. Data Representation for Computing: How is Data Represented in a Computer?
    • Binary Data, Non-binary Data
  2. Data Structure & Semantics: How well is the data structured?
    • Structured Data, Unstructured Data, Semi-structured Data
    • Sequence Data, Time-series Data
    • Panel Data
    • Image Data, Text Data, Audio Data
  3. Data Measurement Scale: How can data be reasoned with and measured?
    • Categorial Data, Nominal Data, Ordinal Data
    • Discreet Data, Interval Data, Numerical Data, Continuous Data
  4. Data Processing: How is the data processed?
    • Streaming Data, Batch Data
    • Data Filtering, Data Mapping, Data Aggregation
    • Clean Data, Dirty Data
    • Data Transformation, Data Extraction, Data Load
    • Data Augmentation, Data Imputation
  5. Data Attributes: How can data be broadly characterized?
    • Velocity, Volume, Veracity, Value, Variety
  6. Data Patterns: What are the patterns found in data?
    • Time-series Data Patterns: Trends, Seasonality, Correlation, Noise
  7. Data Relations: What are the relationships within data?
    • Relational Data, Graph Data, Document Data (Key-value Data, JSON Data)
    • Multi-dimensional Data, Spatial Data
  8. Data Storage Types:
    • Block Data, Object (Blob) Data
  9. Data Management Systems:
    • Filesystem, Database, Data Lake, Data Mart, Data Warehouse, Data Lakehouse
    • Data Indexes
  10. Data Governance, Security, Privacy:
    • Data Catalog, Data Quality, Data Schema, Data Model
    • Data Masking, Data Privacy

More blogs to deep dive into each category and the challenges involved. Let’s peel this onion.

Tunnel Vision

When I drive through a tunnel with my kid, there are two points of excitement – entry point and exit point. When entering a new tunnel, it’s always a “Yay!!” The exit feeling depends on how long we were inside. It’s either the expression – “Finally some light” or “Oh, no, we are out.”

You get my point. We love tunnels. We like to see the light at the end of the tunnel.

We not only love tunnels, but we love to tunnel. Tunneling helps us focus, and there is a focus threshold after which we need to see some light.

Sprinting in agile is Tunneling. After a two-week sprint, we might have the expression, “Oh, no, we are out too soon.”, and after a four-week sprint, we might have the expression, “Finally, some light.”

A two weeks sprint seems to be a global average of adequate time in a sprint. While that is an indicator, the team must choose their sprint duration.

Not all people are alike. Some people like to be first finishers – Tunneling helps them to get from A to Z fastest. However, we remember our journeys for unplanned explorations. Can we visit that lake nearby? Can we bathe in that waterfall? Can we take a different road?

In software projects, the expectation from engineers/architects is that they build a path from point A to Z as fast as possible (tunnel), but the way should be open to exploration by the user. E.g., Build a mobile user interface to update my health parameters, but let me explore new services in the user interface to improve my health.

If you are agile, you can introduce innovation experiments to explore users unwritten needs, making your own journey not feel like a long tunnel.

Takeaway: We need black-box tunnels, exciting tunnels, and open roadways, and in agile software parlance, that loosely equates to sprints, spikes, and MVPs.

Go Slow to Go Fast

“I am adopting Agile; finally, I can tell my customers that since we are going to be agile, the delivery will be late. The agile coach told me that to go fast; you need to slow down. If the customer has questions, I will ask the agile coach to help convince the customer” – Project manager

Sometimes, projects need to be rescued. They have either messed up the quality, or schedule, or both. They had some “accurate estimations” done 12 months back for the entire project. Now they have to prove that their estimations were not wrong irrespective of the scope creep, requirements ambiguities, technology risks that popped up, and COVID impacting day-2-day life.

Such projects can be rescued without agile principles. The team can get a stock of current realities and re-plan, and continue this process until successful. There are very successful waterfall projects. There are failed waterfall projects that were rescued with waterfall. A non-agile method to rescue the project will still ask you to stop, re-plan, and continue, i.e., go slow (stop-think) to go fast.

An agile method to rescue the project will also ask you to stop, re-plan, and continue iterations. i.e., deliver a small “good” quality feature increment that does not break the product, then continue. While some team members have a sprint tunnel vision, others will look beyond a sprint to ensure that they have “ready” features to add to the product when they meet their DOD, and their stories are accepted. After some sprints, the team knows more about its velocity and can predict a schedule (with a guaranteed quantum of quality). The backlog would also be groomed (prioritized, broken-down, detailed) in parallel by the product owners while the team is sprinting on the stories that were picked up. Agile enables teams to continue on features that are “ready” and not “halt.” If the reason for distress is “quality,” then the stories could be “debt” that must be paid off until we can build some more.

Agile is not an excuse to stop. Even when you are agile, you can accumulate debt that results in an inferior quality product, and prioritization/judgment must ensure that the debt is not out of control. There is always good debt (to gain speed) and bad debt (than hampers quality); and some intersection of the two. Even in agile, there is an in-built “stop” with iterations and a “STOP” requested by the team/customer.

Takeaway: Agile or non-agile a project in distress requires to stop, think, and continue.

A best practice that I have used to prevent projects to get into a bad state is to build an architectural runway (intentional architecture) outside the sprinting zone. “Readiness” to sprint is critical for success of the team.

Data-driven, Metric-driven

Some say they are the same – metrics are computed on data. So, I must be data driven if I am already metric driven.

I claim that they are different. Agile combines them together in a nice way.

You would be metrics-driven to achieve a goal. Goal coaches insist that goals should be SMART (Specific, Measurable, Attainable, Realistic/Relevant, and Time bound). E.g., I want to be an millionaire in one year is a SMART goal, that can be measured along the way (lead metrics) and once an year has elapsed (lag metrics). The lead metrics tell you whether you are on the right path to achieve the goal, while the lag metrics tells you whether you have achieved or how well you have achieved the goal.

You would be data-driven when you have to deal with ambiguities. Data coaches insist that good quality data must be captured continuously to seek insights that can be converted to knowledge and actions. E.g., I want to drive academic excellence for my child this year. This statement does not tick all the boxes of SMART. It does not say that my child should score A+ in mathematics. There is inherent ambiguity in the statement and choice of words (“excellence”). In such situations, you collect data from tests, teachers feedback, and your own observations. Based on the data, you get insights – the child is great at arts, excellent at mathematics, and needs improvement in language. You then focus on sustaining the strengths (arts, mathematics), and focus on development needs to improve academic excellence. Being data-driven is all about seeking actionable insights. You may make a decision to not take any action to improve language skills and let the child excel in her strengths, but that’s still a data-driven decision.

Agile helps with the cone of uncertainty with the levels of agility. E.g., In a software development context, the team may look at lead metrics like flow velocity and team happiness to determine whether we are on the right track to reach the outcome measured by the lag metrics. However, the team will also look at data (new features, technical debt, customer feedback) to derive whether a change of course is required. So, you get the benefit of both being iterative. Being iterative, and taking small chunks of work to do (stories with 8 story points), you can be metric-driven (SMART) to measure say/do as a lag-metric. Also, grooming the product backlog with insights from data, you will be data-driven.

Technical Career and Competencies

Some people plan their careers. Others don’t and let it happen. Which one is better?

A great career coach will talk to you and advice to either plan or to flow. A good career coach will always ask you to plan. A mediocre career coach will ask you to flow.

Expertise is a necessary attribute for a great career, but not sufficient. A performance track record is another necessary attribute for a great career, but not sufficient. Presidential communication and influencing skills is yet another necessary attribute for a great career, but not sufficient.

It’s easy to list down the ingredients of a great recipe (methods for a great career), but different cooks with the same recipe have different results. So, there is also some luck and practice involved.

For this blog post, lets focus on EXPERTISE. Early technical careers are measured by the depth of technical competencies. Technical competencies define late career choices as well. My advice has always been to develop 1-2 competencies in early career to build depth. Some broad technical competencies are:

Web (Cloud Scale) Software
Enterprise Software
Device (Mobile) Software
Device (Embedded) Software
Security and Privacy
Artificial Intelligence (BI/ML)
Data Engineering
DevOps
Test Automation
Robotic Process Automation
Operations Research
Agile

The list is not comprehensive, but is a classification of the type of software problems that software engineers & architects solve today for the market. The engineers have to build different mindset and skills for each class of problems.

E.g., As a Web Software Engineer, you would need to have the mindset of building for scale, and skills to debug programs that may fail at scale. As a Device (Embedded) Software Engineer, you would need to have the mindset of building for scarcity (memory, cpu), and skills to debug programs with concurrency related failures. As a Enterprise Software Engineer, you would need to have the mindset of integration, and skills to debug programs with integration/messaging problems with other systems. As a Data Engineer, you would need to have the mindset of processing data in batches, and skills to find anomalies and patterns in data. As an Agile/DevOps Engineer, you would need to have the mindset of continuous improvement, and skills to automate workflows.

Best bet – early in the career, if you know your natural mindset, you can choose a competency that fits you. Later, choose a competency that challenges you.

Don’t choose a competency that claims to maximize your cash flow. So, work to strengthen your strengths, and later work on your development opportunities.

Summary: Once you plan you need to let it happen, and measure your happiness. If you are not satisfied with the flow, you need to plan a change. Rinse-repeat until you are settled and happy. If you are content, plan to change.

Myth: Architects only create Diagrams

Like it or not. Diagrams (visual) are a great communication tool. If one of the responsibility of architects is communication, there is no better way than visual communication. Contextual communication requires that the same information to be represented differently for effective communication.

Architects (titled or not) have a responsibility to analyze data from various sources:

a) Requirements coming from customer or product manager.
b) Complaints coming from customer or product manager.
c) Constraints coming from customer or product manager.
d) Constraints and opportunities from operational leaders.
e) Technology advances in industry.
f) Patterns and practices in architecture.
g) Feedback from development teams.
h) Feedback from independent consultants (peers, stakeholders).
i) Inputs from security teams.
j) Inputs from operational teams.

All this data needs to be analyzed to produce conceptual and detailed sketches (as required) for construction. Today, most of these sketches are conceptual. Teams are very skilled to develop the detailed sketches right in code.

The architect is like a data scientist working on all this data to determine the function that ‘fits’. This function is represented as a diagram – a view, or a perspective. The diagram could even be a simple table.

Stating that ‘architects only create diagrams’, however, is a poor critique of the effort. Creating conceptual clarity is important for the architect. But the architect’s job does not end at diagrams. They have to communicate, plan, and code. Unless, the diagram is realized, it’s useless.

Just like code and configuration should be treated like ‘code’; documentation and diagrams also need to be treated like ‘code’. This means – reviewed, maintained, tested, critiqued, destroyed, re-factored, …

Only treating ‘code’ like ‘code’ is bad ‘coding’. Code, configuration and concepts need to be treated like ‘code’.

Dismissing concepts (usually the work of an architect) is immature.

The best representation of ‘architecture’ is ‘code’. The first draft’s of ‘code’ are ‘concepts’. ‘Concepts’ are represented as ‘diagrams’ for communication. AND. communication is a good thing.

Enough said.

Myth: Form Follows Function

This argument gets philosophical.

In classic building architecture, the form of a stadium is meant to perform the function to host large audiences. The form of a house is meant to perform the function to host families. A stadium form is not used to home a family! So, it follows, form follows function. This argument then leads to architects and designers to start from function to create forms.

a) Can an architect create form of a stadium to home a family of four? Why not?
b) What about Bill Gates house? Is that form follows function, or form follows money spent?

Let’s take another example. How about a car? The function of a car is to take a few passengers from point A to point B safely.

However, cars are also used for

a1. Kidnapping people
a2. Moving goods (not passengers)
a3. Racing

Seems like new functions were discovered once the form (‘car’) appeared. The form was adopted for other ‘functions’; not originally intended for…

Let’s take another example. How about email? The function of email is to communicate a message from a source person to several destinations persons.

However, emails are also used for

a) Advertisement
b) Spreading malware
c) Sharing photos
d) Sharing legal documents

Seems like new functions were discovered once the form (’email’) appeared. The form was adopted for other ‘functions’; not originally intended for…there may be specific platforms for ‘sharing photos’ e.g. Instagram; these specific forms could also be used for ‘selling drugs’ and ‘pornography’; not the original function.

Let’s take another example. How about Blockchain? The function of blockchain is to maintain a distributed ledger of transactions.

However, blockchain can also be used for

a. Creating and maintaining a patient’s historical record
b. Internet of Things
c. Voting

Seems like new functions were discovered once the form (‘blockchain’) appeared. The form was adopted for other ‘functions’; not originally intended for…there may be specific platforms for maintaining patient’s record or maintaining registry of connected things.

These examples above, indicate that “Unintended Functions Follows Form that Follows Original Function”. Kind of a recursive loop!

Having a functionalist (modernist) mindset in design & architecture is good; however, it creates blind-spots in designers that makes them choose from past (known) stereotype solutions and patterns. This results in re-use of known patterns; good for manufacturing & speed, bad for creativity.

OK! Just trying to say that there it’s more complicated than form following function. These days, it’s really form following profit and function following form. Flexibility and open mind is required to keep up in this changing world.

Myth: Architecture field does not have specializations.

A Quick search for software architect titles would dispel this myth.

  • Application Architect
  • Java Architect
  • Cloud Architect
  • Database Architect
  • Data Architect
  • Analytics Architect
  • Security Architect
  • SOA Architect
  • Solution Architect
  • Product Architect

Ten (#10) is a good number to make my point. Now let’s look at super-specialization.

Cloud Architect
        AWS Architect
        Azure Architect
        Cloud Foundry Architect
        Google Cloud Architect
Database Architect
        MS SQL Architect
        Postgres Architect
        Azure SQL Architect
        Cassandra Architect

Generalizing this, it could be reasoned that, the ask in the marketplace is for:

b1. Big “A” Architect, known for his/her knowledge of patterns and practices of general architecture.
b2. Specialized Architect, known for his/her knowledge of patterns and practices in a specialized field (e.g. cloud, security, database)
b3. Super Specialized Architect, known for his/her knowledge of patterns and practices in a super specialized field (e.g. Azure, AWS, Postgres)

Hmmm…now now, if you have different super specialties that are required to build a product, would you

c1. Build a team of super specialized architects?
c2. Hire an architect, and have collaboration with super specialized engineers, consult with super specialized architects on a need basis?

My personal preference is #2.

Let’s take a healthcare analogy. In healthcare, you can find generalists, specialists and super specialists – general physicians, cardiologists, pediatric cardiologist, neurologists, dermatologists, endocrinologists, etc.

If you have a headache, where do you go? General Physician or neurologist?

d1. The chance that your condition is classified as a common condition is high, if you visit the general physician.
d2. The chance that your condition will be over-tested is high, if you visit a specialist.

It’s generalist and specialist bias.

Same is true with software architecture. Go to a software architect for an authentication problem, the solution proposal will look simple e.g. Use LDAP. If you go to a security architect for an authentication problem, the solution proposal will be comprehensive e.g. Use LDAP, Enable SAML & OAuth2 for single sign on, Test for OWASP Top 10 Web Application Security Risk, Develop threat model, …

The specialist is also costlier than the generalist; and the super specialist is costlier than the specialist.

You can always take second opinions with a specialist. However, its best that the system encourages generalists to front-end specialists. A tiered approach is better than walking directly to a specialist.

Just like a family physician (PCP – Primary Care Physician) is the first line of consult, a general architect (Big “A” Architect) is a good start to lead the architecture of a software product/solution.

But – hey – specializations exist in software architecture.

Myth: An architecture can be sold by it’s quality attributes & trade-offs.

An architect has the responsibility to sell her architecture. This could happen in reviews with peers, reviews with stakeholders, reviews with developers, or reviews with operational managers.

Certain things can be learnt by the architect from the patterns and practices of the sales discipline. There are two key practices in sales.

  1. Value based selling. In this art form, the sales person describes and re-inforces the value of the product that she is selling; some key quality attribute e.g. brand, reliability, best-seller, discounted only for the month, etc.
  2. Solution based selling. In this art form, the sales person understands the customers pain points; and creates a basket of products/services that address the pain points. The outcome is a personal solution to the pain points.

I have seen that good architects, can be perceived as great; if they can adapt to the various stakeholders – peers, developers, users, managers, etc. i.e. adopt solution selling.

The general practice in architecture is to collect requirements from different stakeholders, do some industry research, apply some brain power and experience, and come out with an architecture that could meet the needs of various stakeholders. The architect is proud of the architecture that she has developed, and would like the stakeholders to accept the proposal; and seeks to sell the value proposition of the architecture produced…i.e. the guarantees of best-in-class and latest technologies, the quality of data stored, etc.

However, for most stakeholders, it’s really about

  1. “What’s in it for me?” and
  2. “How is my need met?”

For a peer architect, usually coming in from a very specific perspective, it’s about making sure a certain quality attribute is done well. E.g. dependent API’s.

For a developer, it’s usually about the technology used to build and ease+comfort of developing the product.

For a user, it’s usually about meeting the primary functional and non-functional goals.

For an operational manager, it’s usually about having an architecture defined that is generally accepted; and knowing the potential structure she needs to setup to execute the plan; and knowing what risks need to be tracked.

So, when an architecture is described using – logical view, deployment view, security perspective, etc; the stakeholders of the architecture (users, developers, managers, peers)  are not specifically addressed.

In my experience, an architecture view is very personal to a stakeholder, and must be created to meet the needs of the stakeholder (i.e. solution selling). It’s important to understand the pain point of the stakeholder, and sell the architecture in context of the pain points.

Bottomline, an architecture description should be minimal enough to sell the architecture to various stakeholders. A comprehensive data view is great for the architect to get the BIG picture, but not a great selling proposition for users, developers, managers, or peers.

Specific and customised selling helps to improve and iterate on architecture. A good architect, is a great architect, when she can sell the architecture successfully. Capitalism!

Myth: Architect belongs to the architecture team

Yes, and no. Necessary, but not sufficient.

Architecture Office (AO) in particular is very strict & diligent about their patterns and practices. They are empowered & responsible for governance (in a good way), and rationalize (harmonize) efforts in the organization. The architect feels a belonging to this team to derive rules & priorities; and contribute/influence the big picture.

Agile (scrum) teams in particular have high cohesion. They disown anybody that don’t work with them in their pods. In spirit, of an empowered team, they want the team to make decisions, and hence, the architect must belong to them. This is the best way to influence work that happens on the ground.

X-Functional Leadership teams are typically composed of program, product, operational, business, architectural, manufacturing and engineering points of contact. Teaming with such a x-functional group is very critical to the success of a x-functional initiative.

Each team is a mission focused team. Each team is empowered. Each team is connected to other teams (team network grid) for information flow & communication.

The architect (and other titled roles) find themselves to be part of several teams. I have had architects tell me that they see the “scrum role” as solid line, and “AO” or “X-Functional LT” as dotted line. Some others tell me that “AO” is solid line, and “scrum role” or “X-Functional LT” is dotted line. i.e. in essence they are prioritizing a team to either spend time or to trade-off a quality attribute. It’s not just navigating a matrix, but a network of matrices. Trading off – Authority & Influence. This can become complex.

If Architecture is the art of managing complexity, and complexity is due to structure, then the ‘architect’ has to play in space between spaces. (S)he has to be in different teams. Given the nature of the complexity, sometimes, (s)he has to play devil’s advocate with the scrum team, and sometimes with the “AO”. (S)he has to own the architectural decision, that (s)he is responsible for, through transparent communication. (S)he has to make all such teams believe/perceive that (s)he is part of that team.

Being part of multiple teams is a reality. Different leaders pulling in different & sometimes conflicting directions is also a reality. Being part of multiple teams is an opportunity to influence & carry information. The last thing that should happen is to lose your mind trying to do everything. It’s best to plugged into each team, make them believe/perceive that you are working for them, however, focus on one thing at a time. Multiplexing is like an adult choice.