Unlock Efficient Data Management: 2 Strategies to Boost Business Success

You may also enjoy:
Change Data Capture (CDC) With Embedded Debezium and Spring Boot

The emergence of change data capture (CDC) and event-driven systems has recently become a recurring theme in my conversations and online explorations, prompting me to address the surrounding confusion.
CDC and event-driven communication are two distinct concepts that share some similarities, hence the confusion. It's essential to exercise caution, as mistaking one for the other can lead to extremely challenging architectural situations.

Unraveling the Concepts

Change Data Capture (CDC) typically entails a mechanism for tracking all changes occurring within a system's data. The need for such a system is not difficult to imagine — auditing sensitive information, replicating data across multiple DB instances or data centers, or transferring changes from transactional databases to data lakes/OLAP stores. Transaction management in ACID compliant databases essentially employs CDC. A CDC system is a record of every change made to an entity, along with the metadata of that change (changed by, change time, etc).

You may also enjoy:
Change Data Capture (CDC) With Embedded Debezium and Spring Boot

I have previously discussed events on this blog, describing them as announcements of something that has occurred within the system domain, accompanied by relevant data about that occurrence. At first glance, this might seem identical to CDC — something changes in a system, and this needs to be communicated to other systems — which is precisely what CDC is about.

However, there is a crucial distinction to be made here. Events are defined at a much higher level of abstraction than data changes because they represent meaningful changes to the domain. Data representing an entity can change without having any “business” impact on the overall entity that the data represents. There can be several sub-states of an order that an order management system might maintain internally but which do not matter to the outside world. 

An order moving to these states would not generate events, but changes would be logged in the CDC system. On the other hand, there are states that the rest of the world cares about (created, dispatched, etc.) and the order management system explicitly exposes to the outside world. Changes to or from these states would generate events.

The difference can be explicitly stated in terms of system boundaries. When we design microservices or perform any system decomposition, we are trying to identify and isolate bounded-contexts or business domains from each other. This is the foundation of all domain-driven design.
CDC is concerned with capturing data changes within a system's bounded context, usually in terms of the physical model. The system records changes to its own data. Even if we have a separate service or system that stores these changes (some sort of platformized audit store), the separation is an implementation detail. There is a continuity of domain modeling between the actual data and changes to it, hence both belong logically inside the same boundary.  
Bounded context
Events, on the other hand, are domain model level broadcasts emitted by one bounded context to be consumed by other bounded contexts. These represent semantically significant events in a language that the external systems can understand and respond to. That they are published over the same messaging medium, use similar frameworks, or get persisted somewhere are all implementation details.
Source: Computer Technicians
Delving into CQRS
Have you considered designing a system based on the CQRS (Command Query Responsibility Segregation) architectural paradigm? For the uninitiated, CQRS is an approach that separates the data model and technologies employed for writes (Command) from those used for reads (Query).
This design is often employed when there is a significant disparity between the write patterns and read patterns to be supported. I have provided a concise illustration of such a system in my
case study on the intricacies of asynchronous programming. Updates to the command model are propagated to the read model, typically (but not necessarily) asynchronously. Can we leverage CDC for this purpose? or should the command module emit events that are read by the query module to build its data model? 

System A to system B

I would argue that since the command-query model separation is an internal design aspect of the system, both models reside within the same bounded context, and using CDC logs would not be inappropriate. Both producer and consumer operate at the same level of abstraction (both are data stores, although one may be MySQL and the other
ElasticSearch ), so utilizing DB-level change logs is not a bad idea. This, of course, is just an opinion. Using events here would not be bad either, especially if different teams manage the models.
The command module should always emit events anyway, if only to
imbue the overall architecture with evolutionary characteristics.
Constructing CDC and Event-Driven Systems
In modern distributed setups, change data is typically disseminated over a messaging medium like
Kafka and can then be consumed by other systems that want to store this data. A very popular and efficient way of building CDC systems is by tailing the internal log files of databases (
MySQL and other relational DBs always have this for transaction management, ElasticSearch has a change stream in its newer versions) using something like
Filebeat and then publishing the logs over Kafka. 
The other side typically has
Logstash type plugins to ingest data into other systems that persist this change log. Consumers may also be
Spark/
Flink style streaming applications that consume and transform this data into a form suitable for other use cases.

kafkaIn certain scenarios, tapping into database change logs may not be a viable option, as not all systems provide this functionality. In such cases, we need to integrate code into the application layer to generate change logs. Ensuring that data updates and log emissions occur simultaneously, or not at all, is an extremely challenging problem to solve (essentially an atomic update problem: how to guarantee that both database updates and Kafka event emissions occur or are rolled back). Data consistency is paramount in a Change Data Capture (CDC) system.

kafka

When building an event-driven system, we would need to incorporate event generation logic at the application layer, similar to CDC for databases without log files. This is the only point where we can translate the database language into the domain language. As with CDC, preventing event loss in the publisher is crucial to the design.
However, some proponents suggest utilizing CDC streams as a system’s event stream, which I strongly disagree with due to the reasons mentioned above. This approach would tightly couple other systems to our system’s physical data model, forcing us to maintain identical public entities as the database model. This severely restricts the expressiveness of our domain model. Consider an order cancellation scenario. The CDC system would record something like
ChangeLog{“order number” : “12345”, “changed field” : “status”, “old value”: “in progress”, “new value” : “cancelled”}
If I were to express this in my domain language, describing what can or cannot happen to orders, I would ideally convey something like
OrderEvent {“order number” : “12345”, “event type” : “order cancellation”}
This abstraction would not be possible if we physically tie the transmission language to CDC language.

In Conclusion

One crucial aspect to remember when building software is that seemingly similar concepts using similar tools can have distinct differences. Especially when working with logical and physical models, we should be cautious to separate implementation details from the essence of what is being implemented. 
Carefully examine the publisher and consumer of the record being published — if they are both defined at the “data store” level, we are likely dealing with CDC. If they revolve around business constructs (bounded contexts) like order, courier, invoice, etc., we are likely in the realm of events.

Nicholas Parker

14 Blog posts

Comments