streaming – Digital Journey

“We already have pub/sub messaging infrastructure in our platform. Why are you asking for a streaming infrastructure? Use our pub/sub messaging infrastructure” – Platform Product Manager

Streaming and Messaging Systems are different. The use-cases are different.

Both streaming and messaging systems use the pub-sub pattern with producers posting messages and consumers subscribing. The subscribed consumers may choose to poll or get notified. Consumers in streaming systems generally poll the brokers, and the brokers push messages to consumers in messaging systems. Engineers use streaming systems to build data processing pipelines and messaging systems to develop reactive services. Both systems support delivery semantics (at least once, exactly once, at most once) of the messages. Brokers in streaming systems are dumber than messaging systems that build routing and filtering intelligence in the brokers. Streaming systems are faster than messaging systems due to a lack of routing and filtering intelligence 🙂

Let’s look at the top three critical differences in detail:

#1: Data Structures

In streaming, the data structure is a stream, and in messaging, the data structure is a queue.

“Queue” is FIFO (First In First Out) data structure. Once a consumer consumes an element, it is removed from the queue, reducing the queue size. A consumer cannot fetch the “third” element from the queue. Queues don’t support random access. E.g., A queue of people waiting to board a bus.

“Stream” is a data structure that is partitioned for distributed computing. If a consumer reads an element from a stream, the stream size does not reduce. The consumer can continue to read from the last read offset within a stream. Streams support random access; the consumer may choose to seek any reading offset. The brokers managing streams keep the state of each consumer’s reading offset (like a bookmark while reading a book) and allow consumers to read from the beginning, the last read offset, a specific offset, or the latest. E.g., a video stream of movies where each consumer resumes at a different offset.

In streaming systems, consumers refer to streams as Topics. Multiple consumers can simultaneously subscribe to topics. In messaging systems, the administrator configures the queues to send messages to one consumer or numerous consumers. The latter pattern is called a Topic used for notifications. A Topic in the streaming system is always a stream, and it’s always a queue in a messaging system.

Both stream and queue data structures order the elements in a sequence, and the elements are immutable. These elements may or may not be homogenous.

Queues can grow and shrink with publishers publishing and consumers consuming, respectively. Streams can grow with publishers publishing messages and do not shrink with consumers consuming. However, streams can be compacted by eliminating duplicates (on keys).

#2: Distributed (Cluster) Computing Topology

Since a single consumer consumes an element in a queue in a load-balancing pattern, the fetch must be from the central (master) node. The consumers may be in multiple nodes for distributed computing. The administrator configures the master broker node to store and forward data to other broker nodes for resiliency; however, it’s a single master active-passive distributed computing paradigm.

In the notification (topic) pattern, multiple consumers on a queue can consume filtered content to process data in parallel. The administrator configures the master node to store and forward data to other broker nodes that serve consumers. The publishers publish to a single master/leader node, but consumers can consume from multiple nodes. This pattern is the CQRS (Command Query Responsibility Segregation) pattern of distributing computing.

The streaming pattern is similar to the notification pattern w.r.t. distributed computing. Unlike messaging, partition keys break streams into shards/partitions, and the lead broker replicates these partitions to other brokers in the cluster. The leader election process selects a broker as a leader/master for a given shard/partition, and shard/partition replications serve multiple consumers in the CQRS pattern. The consumers read streams from the last offset, random offset, beginning, or latest.

If the leader fails, either a passive slave can take over, or the cluster elects a new leader from existing slaves.

#3: Routing and Content Filtering

In messaging systems, the brokers implement the concept of exchanges, where the broker can route the messages to different endpoints based on rules. The consumers can also filter content delivered to them at the broker level.

In streaming systems, the brokers do not implement routing or content filtering. Consumers may filter content, but utility libraries in the consumer filter out the content after the broker delivers the content to the consumer.

Tabular Differences View

Category	Streaming	Messaging
Support Publish and Subscribe Paradigm	Yes	Yes
Polling vs. Notification	Polling by Consumers	Notification by Brokers to consumers
Use Case	Data Processing Pipelines	Reactive (micro)services
Delivery Semantics Supported	at-most-once at-least-once exactly-once	at-most-once at-least-once exactly-once
Intelligent Broker	No	Yes
Data Structure	Stream	Queue
Patterns	CQRS	Content-Based Routing/Filtering Worker (LB) Distribution Notification CQRS
Data Immutability	Yes	Yes
Data Retention	Yes. Not deleted after delivery.	No. Deleted after delivery.
Data compaction	Yes. Key de-duplication.	N/A
Data Homogeneity	Heterogenous by Default. Supports schema checks on data outside the broker.	Heterogenous by Default.
Speed	Faster than Messaging	Slower than Streaming
Distributed Computing Topology	Broker cluster with single master per stream partition and consumers consuming from multiple brokers with data replicated across brokers	Broker cluster with single master per topic/queue. Active-passive broker configuration for the load-balancing pattern. Data replicated across brokers for multiple consumer distribution.
State/Memory	Brokers remember the consumers’ bookmark (state) in the stream	Consumers always consume from time-of-subscription (latest only)
Hub-and-Spoke Architecture	Yes	Yes
Vendors/Services (Examples)	Kafka Azure Event Hub AWS Kinesis	RabbitMQ Azure Event Grid AWS SQS/SNS
Domain Model	A stream of GPS positions of a moving car	A queue to buy train tickets

Table of Differences between Streaming/Messaging Systems

Visual Differences View

Summary

Use the right tool for the job. Use messaging systems for event-driven services and streaming systems for distributed data processing.

The IT industry likes to treat data like water. There are clouds, lakes, dams, tanks, streams, enrichments, and filters.

Data Engineers combine Data Streaming and Processing into a term/concept called Stream Processing. If data in the stream are also Events, it is called Event Stream Processing. If data/events in streams are combined to detect patterns, it is called Complex Event Processing. In general, the term Events refers to all data in the stream (i.e., raw data, processed data, periodic data, and non-periodic data).

The examples below help illustrate these concepts:

Water Example:

Let’s say we have a stream of water flowing through our kitchen tap. This process is called water streaming.

We cannot use this water for cooking without first boiling the water to kill bacteria/viruses in the water. So, boiling the water is water processing.

If the user boils the water in a kettle (in small batches), the processing is called Batch Processing. In this case, the water is not instantly usable (drinkable) from the tap.

If an RO (Reverse Osmosis) filtration system is connected to the plumbing line before the water streams out from the tap, it’s water stream processing with filter processors. The water stream output from the filter processors is a new filtered water stream.

A mineral-content quality processor generates a simple quality-control event on the RO filtered water stream (EVENT_LOW_MAGNESIUM_CONTENT). This process is called Event Stream Processing. The mineral-content quality processor is a parallel processor. It tests several samples in a time window from the RO filtered water stream before generating the quality control event. The re-mineralization processor will react to the mineral quality event to Enrich the water. This reactive process is called Event-Driven Architecture. The re-mineralization will generate a new enriched water stream with proper levels of magnesium to prevent hypomagnesemia.

Suppose the water infection-quality control processor detects E-coli bacteria (EVENT_ECOLI), and the water mineral-quality control processor detects low magnesium content (EVENT_LOW_MAGNESIUM_CONTENT). In that case, a water risk processor will generate a complex event combining simple events to publish that the water is unsuitable for drinking (EVENT_UNDRINKABLE_WATER). The tap can decide to shut the water valve reacting to the water event.

Water Streaming and Processing generating complex events

Data Example:

Let’s say we have a stream of images flowing out from our car’s front camera (sensor). This stream is image data streaming.

We cannot use this data for analysis without identifying objects (person, car, signs, roads) in the image data. So, recognizing these objects in image data is image data processing.

If a user analyses these images offline (in small batches), the processing is called Batch Processing. In the case of eventual batch processing, the image data is not instantly usable. Any events generated from such retrospective batch processing are too late to react.

If an image object detection processor connects to the image stream, it is called image data stream processing. This process creates new image streams with enriched image meta-data.

If a road-quality processor generates a simple quality control event that detects snow (EVENT_SNOW_ON_ROADS), then we have Event Stream Processing. The road-quality processor is a parallel processor. It tests several image samples in a time window from the image data stream before generating the quality control event.

Suppose the ABS (Anti-lock Braking Sub-System) listens to this quality control event and turns on the ABS. In that case, we have an Event-Driven Architecture reacting to Events processed during the Event Stream Processing.

Suppose the road-quality processor generates snow on the road event (EVENT_SNOW_ON_ROAD), and a speed-data stream generates vehicle speed data every 5 seconds. In that case, an accident risk processor in the car may detect a complex quality control event to flag the possibility of accidents (EVENT_ACCIDENT_RISK). The vehicle’s risk processor performs complex event processing on event streams from the road-quality processor and data streams from the speed stream. i.e., by combining (joining) simple events and data in time windows to detect complex patterns.

Data Streaming and Processing generating complex actionable events

Takeaway Thoughts

As you can see from the examples above, streaming and processing (Stream processing) is more desired than batching and processing (Batch processing) because of actionable real-time event generation capability.

Data engineers define data-flow “topology” for data pipelines using some declarative language (DSL). Since there are no cycles in the data flow, the pipeline topology is a DAG (Directed Acyclic Graph). The DAG representation helps data engineers visually comprehend the processors (filter, enrich) connected in the stream. With a DAG, the operations team can also effectively monitor the entire data flow for troubleshooting each pipeline.

Computing leverages parallel processing at all levels. Even with small data, at the hardware processor level, clusters of ALU (Arithmetic Logic Unit) process data streams in parallel for speed. These SIMD/MIMD (Single/Multiple Instruction Multiple Data) architectures are the basis for cluster computing that combines multiple machines to execute work using map-reduce with distributed data sets. The BIG data tools (E.g., Kafka, Spark) have effectively abstracted cluster computing behind common industry languages like SQL, programmatic abstractions (stream, table, map, filter, aggregate, reduce), and declarative definitions like DAG.

We will gradually explore big data infrastructure tools and data processing techniques in future blog posts.

Data stream processing is processing data in motion. Processing data in motion helps generate real-time actionable events.

Tag: streaming

Streaming vs. Messaging