Change Data Capture (CDC) & Why It Matters

change-data-capture

What is change data capture? Does it have any importance or bearing on the work that you do? We’ll answer the first question shortly, but the answer to the second question is most certainly “yes.”

What Is Change Data Capture?

Change data capture (CDC) solves data integration problems by monitoring, inserting, updating, and deleting changes to records or activity on table data.

The first step in this process is to identify that a change has been made to a data source. To do so, a CDC procedure will inspect the relevant table to see if there is a delta between the old values and new values, based on what’s stored in the source systems, then update the pertinent record accordingly.

There are several good methods for doing this — high watermark, diff, and database triggers, for example. However, one of the superior approaches to track data changes is log-based CDC.

While interpreting the changes in the transaction log is difficult, the biggest benefit of log-based change data capture is the asynchronous nature of the CDC: changes are captured independent of the source application performing the changes. With log-based CDC, new database transactions – including inserts, updates, and deletes – are read from source databases’ transaction or redo logs. The changes are captured without making application level changes and without having to scan operational tables, both of which add additional workload and reduce source systems’ performance.

By allowing you to capture and extract only changed data, change data capture eliminates or reduces the need for batch windows, replacing bulk load updates with continuous streaming or incremental loading of data. Change data capture helps you improve efficiency by reducing redundant data replication and delivery, reducing data warehousing costs and facilitating real-time data integration across your data stores.

Why Change Data Capture Matters

There are many advantages of using transaction logs for CDC:

  • This solution is transparent to databases and has minimal impact on the database.
  • Near real-time publishing of the changes to the target system. This means the business can take accurate decisions based on the most current data.
  • No impact on the transactions at the source application, no additional SQL load on the system.
  • Because CDC transfers only incremental changes it reduced the cost of transferring data.
  • This approach maintains the order in which original transactions were committed. This is important when the target application depends on the order of transactions in the source system. The ordering guarantees are most often desired by target applications.
  • The target system can take time to process the messages.

Together these advantages enable the building of streaming data pipelines that help to share application data across a business. This means that businesses are getting fed insights that are up to date and accurate based on the latest data being fed from across many systems. The decisions made from these insights help businesses to remain competitive in their respective markets.

How Etlworks can help

Etlworks is a high-performance data integration platform with the next-generation change data capture technology.

Etlworks supports native log-based change data capture for PostgreSQL, SQL Server, MySQL, Oracle, DB2 and MongoDB databases. Our intuitive visual interface makes it easy to set up, monitor, and manage your data pipelines, eliminating the need for scripting and ensuring quick time-to-value. Unlike other tools that support CDC, there are only two moving parts – your database and Etlworks. You will be up and running with CDC in a matter of minutes!

Learn More about Log-Based CDC with Etlworks for Real-Time Data Replication

Contact Us! 

Real-time change replication with Kafka and Debezium

etlworks-real-time-data-streaming-1200

If you are looking for an end-to-end CDC solution without using third-party services such as Connect and Debezium I recommend Etlwork’s built-in CDC connectors. Read more. Use this tutorial if you are ingesting data into Kafka using standalone Debezium and wish to sink CDC events into any destination using the advanced capabilities of Etlworks connectors and streaming flows.

In this article, I demonstrate how to implement [near] real-time Change Data Capture, or CDC, -based change replication for the most popular databases using the following technologies:

  • Native CDC for each source database
  • Apache Kafka
  • Debezium
  • Etlworks Kafka connector with built-in support for Debezium

Overview

Change Data Capture (CDC), as its name suggests, is a design pattern that captures individual data changes instead of dealing with the entire data. Instead of dumping your entire database, using CDC, you would capture just the data changes made to the master database and apply them to the BI databases to keep both of your databases in sync.

Debezium is a CDC tool that can stream changes from Microsoft SQL Server, MySQL, MongoDB, Oracle, DB2, and PostgreSQL into Kafka, using Kafka Connect.

Kafka Connect is a tool for streaming data between Apache Kafka and external systems. It is used to define connectors that move large collections of data into and out of Kafka.

How it works

The Etlworks Integrator streams CDC events stored in Kafka, automatically transforms events to the DML SQL statements (INSERT / UPDATE / DELETE), and executes SQL statements in the target database in the order they were created. It also supports bulk load into columnar databases. Etlworks streaming flow handles collisions and errors, ensuring the solution is 100% reliable. It is more advanced and supports more destinations than the Kafka Connect sink connectors. 

Etlworks - CDC with Kafka and Debezium

Sources and destinations

The following databases are supported as a source:

  • Microsoft SQL Server 
  • MongoDB
  • MySQL
  • Oracle
  • IBM DB2
  • PostgreSQL

Supported destinations:

Prerequisites

The solution requires installing and configuring Apache Kafka, Zookeeper, Kafka Connect, and Debezium.

This official tutorial from Debezium explains how to install and configure all required components. 

For customers on Enterprise plans, the Etlworks Integrator installs all required components. We provide assistance with installing components to self-hosted customers.

Please contact Etlworks support if you want to enable a real-time change replication for your account.

Quick start

We recommend Docker to install, configure and run Debezium. The Docker must be installed and running in the host OS.

docker-compose file

Here is an example of the docker-compose file, which can be used to quickly install and start the latest versions of  Kafka, Zookeeper, Kafka Connect with Debezium, Kafka UI, and Debezium UI into the development box. 

version: '2'
services:
  zookeeper:
    image: quay.io/debezium/zookeeper:2.1.2.Final
    ports:
     - 2181:2181
     - 2888:2888
     - 3888:3888
  kafka:
    image: quay.io/debezium/kafka:2.1.2.Final
    ports:
     - 9092:9092
    links:
     - zookeeper
    environment:
     - ZOOKEEPER_CONNECT=zookeeper:2181
     - KAFKA_AUTO_CREATE_TOPICS_ENABLE=true
     - ADVERTISED_HOST_NAME=etlworks
  connect:
    image: quay.io/debezium/connect:nightly
    ports:
     - 8083:8083
    links:
     - kafka
    environment:
     - BOOTSTRAP_SERVERS=kafka:9092
     - SCHEMA_HISTORY_INTERNAL_KAFKA_BOOTSTRAP_SERVERS=kafka:9092
     - GROUP_ID=1
     - KAFKA_AUTO_CREATE_TOPICS_ENABLE=true
     - CONFIG_STORAGE_TOPIC=connect_configs
     - OFFSET_STORAGE_TOPIC=connect_offsets
     - STATUS_STORAGE_TOPIC=connect_statuses
  debezium-ui:
    image: quay.io/debezium/debezium-ui:2.2
    ports:
     - 8090:8080
    links:
     - connect
    environment:
     - KAFKA_CONNECT_URIS=http://connect:8083
  kafka-ui:
    image: provectuslabs/kafka-ui
    links:
     - kafka
    ports:
     - 8888:8080
    restart: always
    environment:
      - KAFKA_CLUSTERS_0_NAME=local
      - KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS=kafka:9092

How to start

Step 1. Create docker-compose.yaml using the example above in any folder.

Step 2. cd folder

Step 3. Run docker-compose up.

Docker Desktop

If you are running Docker on Windows or Mac, most likely you are using Docker Desktop. Here is a list of containers provisioned after executing docker-compose up for docker-compose.yaml above. 

docker_desktop

Configure Debezium to capture CDC events and publish them to the Kafka topic(s)

REST API

Assuming that the Debezium is up and running, we will be configuring a connector to the source database using Kafka Connect REST API. In this tutorial, we will be using Microsoft SQL Server. Configuring connectors to other databases is equally simple.

Official tutorials  

The Debezium connectors are created using Kafka Connect REST API, so make sure either curl or Postman is installed in your development box. In this tutorial, we will be using Postman.

Step 1. Verify that Kafka Connect is installed and running.

The default port for Kafka Connect API is 8083. Assuming that it runs on localhost, the URL for the API endpoint which returns configured connectors is:

http://localhost:8083/connectors 

get-connectors

Step 2. Create a new connector for Microsoft SQL Server.

ENDPOINT URL: http://localhost:8083/connectors 

METHOD: POST

PAYLOAD (example):

{
"name": "sqlserver-connector",
"config": {
"connector.class": "io.debezium.connector.sqlserver.SqlServerConnector",
"tasks.max": "1",
"database.hostname": "localhost",
"database.port": "1433",
"database.user": "sa",
"database.password": "password",
"database.dbname": "database_name",
"database.server.name": "database_server_name",
"table.whitelist": "comma separated list of fully_qualified_table_names",
"database.history.kafka.bootstrap.servers": "localhost:9092",
"database.history.kafka.topic": "dbhistory.database_server_name.database_name",

"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "true",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "true",
"include.schema.changes": "true"
}
}

create-connector

Debezium UI

The alternative to REST API for creating Debezium connectors is a Debezium UI. Here is an official tutorial for Debezium UI from the creators of Debezium. 

Note that it is installed when using docker-compose.yaml above. 

To start Debezium UI open the following URL in the browser: http://localhost:8090/.

The naming convention for Kafka topics

Debezium stores CDC events in a separate topic for each table. If the connector was configured using the following parameters:

"database.dbname": "database_name",
"database.server.name": "database_server_name",

The CDC events for the table customer will be stored in a Kafka topic database_server_name.database_name.customer.

Create a Flow to stream CDC events created by Debezium into any destination

Step 1. Create and configure source connection for message queue as explained here.

Step 2. When configuring a source connection, make sure Integration with CDC provider is set to Debezium.

create_flow

Step 3. Create a streaming flow for any of the following destinations:

Step 4. Configure TO (destination).

The TO in the streaming transformation can include any of the [tokens] below:

  • [table] – the source table name.
  • [db] – the source database name.
  • [schema] – the source schema.
  • * – the source topic name.

The flow substitutes [tokens] on the values of the [tokens].

This example demonstrates how to configure a transformation to load data into the destination table in the public schema and with the same name as a source table: public.[table].

trans

Read how to set the TO using JavaScript.

Step 5. Optionally, if you wish the MERGE (UPSERT) the data into the destination table, set Action to CDC MERGE when configuring the streaming flow. If the action is set to CDC MERGE the flow will INSERT records that do not exist in the destination table, UPDATE existing records, and DELETE records that were deleted in the source table.

Schedule the Flow

We recommend using a continuous run Schedule type. The idea is that the streaming Flow runs indefinitely or until there are no messages to extract, or it stops if there is an error. When the Flow stops for any of the reasons above, it restarts automatically after the configured number of seconds.

%d bloggers like this: