Data Replication Methods

database-replication

Data replication takes data from your source databases — Oracle, MySQL, Microsoft SQL Server, PostgreSQL, MongoDB, etc. — and copies it into your destination data warehouse. After you have identified the data you want to bring in, you need to determine the best way to replicate the data so it meets your business needs.

Choosing the right method

The method you choose impacts the end state of your data. Fortunately, there are data replication methods built to integrate with today’s data warehouses and suit a variety of use cases. At Etlworks, we believe in providing users with as much flexibility as possible. Let’s discuss each of the five methods of data replication and outline the option that may be best for you.

High Watermark (HWM)

The concept of Watermark refers to a flood after-match in which you look at the water stains in a wall to figure how high the water got, which is pretty much what we want to do: figure out which was the last item we updated and move from there on. Therefore, Watermark is a tool to simplify querying for updated objects, which is a very common use case when synchronizing data.

Pros:

  • fast
  • works for all data sources, including all databases, files, and APIs

Cons:

  • does not support deletes
  • requires a dedicated high watermark field in each table

Change Data Capture (CDC)

CDC is an approach to data integration that is based on the identification, capture, and delivery of the changes made to the source database and stored in the database ‘redo log’, also called ‘transaction log’. CDC or Log Replication is the fastest and most reliable way to replicate. It involves querying your database’s internal change log every few seconds, copying the changes into the data warehouse, and incorporating them frequently. CDC is the best method for databases that are being updated continually and fully supports deletes.

Pros:

  • fast
  • no polling from database tables – uses database redo log instead
  • supports deletes
  • enables near real-time replication

Cons:

  • currently supports only Postgres, MySQL, SQL Server, and Oracle
  • some older versions of the databases above do not support CDC
  • requires extra setup in the source database

Database Triggers

Trigger-based change replication can be implemented in many ways but the basic idea is that each table, which participates in a change replication as a source, has triggers for INSERT, UPDATE, and optionally DELETE. The triggers update the shadow table (or tables). The shadow tables may store the entire row to keep track of every single column change, or only the primary key is stored as well as the operation type (insert, update or delete).

Pros:

  • works for any source database which has triggers
  • no extra requirements for the specific version of the database or extra field in each table

Cons:

  • requires adding triggers to all database tables
  • triggers can negatively impact performance

Real-time CDC with Kafka

Apache Kafka is a popular technology to share data between systems and build applications that respond to data events. Etlworks completes Apache Kafka solutions by delivering high-performance real-time data integration.

Etlworks parses the CDC events emitted to the Kafka topic, automatically transforms events to the DML SQL statements (INSERT/UPDATE/DELETE), and executes SQL statements in the target database in the order they were created. It also handles the collisions and errors, ensuring that the solution is 100% reliable.

Pros:

  • fast
  • no polling from database tables
  • supports deletes
  • supports real-time replication

Cons:

  • complex setup (requires Kafka, Zookeeper, Kafka Connect, and Debezium)
  • supports only Postgres, MySQL, SQL Server, Oracle, and MongoDB
  • some older versions of the databases above do not support CDC
  • requires extra setup in the source database

Full refresh

Sometimes the simplest approach is the best. Full refresh replication method is best for small tables, static data, or one-time imports. Because it takes time to perform the full refresh, it’s a typically slower method than the others.

Pros:

  • the simplest to setup
  • can be quite fast for the relatively small datasets (<100K records)
  • works for all data sources

Cons:

  • not recommended for large datasets

Want to learn more about our replication options and what’s best for your data? Talk to us!

Real-time change replication with Kafka and Debezium

etlworks-real-time-data-streaming-1200

If you are looking for an end-to-end CDC solution without using third-party services such as Connect and Debezium I recommend Etlwork’s built-in CDC connectors. Read more. Use this tutorial if you are ingesting data into Kafka using standalone Debezium and wish to sink CDC events into any destination using the advanced capabilities of Etlworks connectors and streaming flows.

In this article, I demonstrate how to implement [near] real-time Change Data Capture, or CDC, -based change replication for the most popular databases using the following technologies:

  • Native CDC for each source database
  • Apache Kafka
  • Debezium
  • Etlworks Kafka connector with built-in support for Debezium

Overview

Change Data Capture (CDC), as its name suggests, is a design pattern that captures individual data changes instead of dealing with the entire data. Instead of dumping your entire database, using CDC, you would capture just the data changes made to the master database and apply them to the BI databases to keep both of your databases in sync.

Debezium is a CDC tool that can stream changes from Microsoft SQL Server, MySQL, MongoDB, Oracle, DB2, and PostgreSQL into Kafka, using Kafka Connect.

Kafka Connect is a tool for streaming data between Apache Kafka and external systems. It is used to define connectors that move large collections of data into and out of Kafka.

How it works

The Etlworks Integrator streams CDC events stored in Kafka, automatically transforms events to the DML SQL statements (INSERT / UPDATE / DELETE), and executes SQL statements in the target database in the order they were created. It also supports bulk load into columnar databases. Etlworks streaming flow handles collisions and errors, ensuring the solution is 100% reliable. It is more advanced and supports more destinations than the Kafka Connect sink connectors. 

Etlworks - CDC with Kafka and Debezium

Sources and destinations

The following databases are supported as a source:

  • Microsoft SQL Server 
  • MongoDB
  • MySQL
  • Oracle
  • IBM DB2
  • PostgreSQL

Supported destinations:

Prerequisites

The solution requires installing and configuring Apache Kafka, Zookeeper, Kafka Connect, and Debezium.

This official tutorial from Debezium explains how to install and configure all required components. 

For customers on Enterprise plans, the Etlworks Integrator installs all required components. We provide assistance with installing components to self-hosted customers.

Please contact Etlworks support if you want to enable a real-time change replication for your account.

Quick start

We recommend Docker to install, configure and run Debezium. The Docker must be installed and running in the host OS.

docker-compose file

Here is an example of the docker-compose file, which can be used to quickly install and start the latest versions of  Kafka, Zookeeper, Kafka Connect with Debezium, Kafka UI, and Debezium UI into the development box. 

version: '2'
services:
  zookeeper:
    image: quay.io/debezium/zookeeper:2.1.2.Final
    ports:
     - 2181:2181
     - 2888:2888
     - 3888:3888
  kafka:
    image: quay.io/debezium/kafka:2.1.2.Final
    ports:
     - 9092:9092
    links:
     - zookeeper
    environment:
     - ZOOKEEPER_CONNECT=zookeeper:2181
     - KAFKA_AUTO_CREATE_TOPICS_ENABLE=true
     - ADVERTISED_HOST_NAME=etlworks
  connect:
    image: quay.io/debezium/connect:nightly
    ports:
     - 8083:8083
    links:
     - kafka
    environment:
     - BOOTSTRAP_SERVERS=kafka:9092
     - SCHEMA_HISTORY_INTERNAL_KAFKA_BOOTSTRAP_SERVERS=kafka:9092
     - GROUP_ID=1
     - KAFKA_AUTO_CREATE_TOPICS_ENABLE=true
     - CONFIG_STORAGE_TOPIC=connect_configs
     - OFFSET_STORAGE_TOPIC=connect_offsets
     - STATUS_STORAGE_TOPIC=connect_statuses
  debezium-ui:
    image: quay.io/debezium/debezium-ui:2.2
    ports:
     - 8090:8080
    links:
     - connect
    environment:
     - KAFKA_CONNECT_URIS=http://connect:8083
  kafka-ui:
    image: provectuslabs/kafka-ui
    links:
     - kafka
    ports:
     - 8888:8080
    restart: always
    environment:
      - KAFKA_CLUSTERS_0_NAME=local
      - KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS=kafka:9092

How to start

Step 1. Create docker-compose.yaml using the example above in any folder.

Step 2. cd folder

Step 3. Run docker-compose up.

Docker Desktop

If you are running Docker on Windows or Mac, most likely you are using Docker Desktop. Here is a list of containers provisioned after executing docker-compose up for docker-compose.yaml above. 

docker_desktop

Configure Debezium to capture CDC events and publish them to the Kafka topic(s)

REST API

Assuming that the Debezium is up and running, we will be configuring a connector to the source database using Kafka Connect REST API. In this tutorial, we will be using Microsoft SQL Server. Configuring connectors to other databases is equally simple.

Official tutorials  

The Debezium connectors are created using Kafka Connect REST API, so make sure either curl or Postman is installed in your development box. In this tutorial, we will be using Postman.

Step 1. Verify that Kafka Connect is installed and running.

The default port for Kafka Connect API is 8083. Assuming that it runs on localhost, the URL for the API endpoint which returns configured connectors is:

http://localhost:8083/connectors 

get-connectors

Step 2. Create a new connector for Microsoft SQL Server.

ENDPOINT URL: http://localhost:8083/connectors 

METHOD: POST

PAYLOAD (example):

{
"name": "sqlserver-connector",
"config": {
"connector.class": "io.debezium.connector.sqlserver.SqlServerConnector",
"tasks.max": "1",
"database.hostname": "localhost",
"database.port": "1433",
"database.user": "sa",
"database.password": "password",
"database.dbname": "database_name",
"database.server.name": "database_server_name",
"table.whitelist": "comma separated list of fully_qualified_table_names",
"database.history.kafka.bootstrap.servers": "localhost:9092",
"database.history.kafka.topic": "dbhistory.database_server_name.database_name",

"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "true",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "true",
"include.schema.changes": "true"
}
}

create-connector

Debezium UI

The alternative to REST API for creating Debezium connectors is a Debezium UI. Here is an official tutorial for Debezium UI from the creators of Debezium. 

Note that it is installed when using docker-compose.yaml above. 

To start Debezium UI open the following URL in the browser: http://localhost:8090/.

The naming convention for Kafka topics

Debezium stores CDC events in a separate topic for each table. If the connector was configured using the following parameters:

"database.dbname": "database_name",
"database.server.name": "database_server_name",

The CDC events for the table customer will be stored in a Kafka topic database_server_name.database_name.customer.

Create a Flow to stream CDC events created by Debezium into any destination

Step 1. Create and configure source connection for message queue as explained here.

Step 2. When configuring a source connection, make sure Integration with CDC provider is set to Debezium.

create_flow

Step 3. Create a streaming flow for any of the following destinations:

Step 4. Configure TO (destination).

The TO in the streaming transformation can include any of the [tokens] below:

  • [table] – the source table name.
  • [db] – the source database name.
  • [schema] – the source schema.
  • * – the source topic name.

The flow substitutes [tokens] on the values of the [tokens].

This example demonstrates how to configure a transformation to load data into the destination table in the public schema and with the same name as a source table: public.[table].

trans

Read how to set the TO using JavaScript.

Step 5. Optionally, if you wish the MERGE (UPSERT) the data into the destination table, set Action to CDC MERGE when configuring the streaming flow. If the action is set to CDC MERGE the flow will INSERT records that do not exist in the destination table, UPDATE existing records, and DELETE records that were deleted in the source table.

Schedule the Flow

We recommend using a continuous run Schedule type. The idea is that the streaming Flow runs indefinitely or until there are no messages to extract, or it stops if there is an error. When the Flow stops for any of the reasons above, it restarts automatically after the configured number of seconds.

Etlworks CEO: Efficient data structures and real-time data streaming

Data-Mgmt

Search Business Analytics recently interviewed Etlworks CEO, Maksym Sherbinin about the importance of creating efficient data structures and real-time data streaming.

Here are his insights:

Q: To what extent does defaulting to JSON as a data storage mechanism contribute to decreased analytics performance and larger file sizes?

A: There are two factors that contribute to JSON (as well as CSV and XML) not being an optimal format for storing and processing large datasets:

  • The lack of schema enforcement, which results in the same data being injected and stored multiple times in the data warehouse, definitely has a direct impact on performance, quality of service, and cost.
  • JSON-encoded datasets cannot be stored in a parallel manner, decreasing its usefulness for systems built on top of the Hadoop and other frameworks used for distributed processing.

Q: What are the existing alternatives for developers to create more efficient data structures for different kinds of analytics workloads in terms of data/file formats?

A: There are two very distinct alternatives:

  • Column-based databases optimized for data warehousing and massively parallel processing, specifically Snowflake and Amazon Redshift.
  • Formats, optimized for use in Hadoop clusters and data streaming: Optimized Row Columnar (ORC), Avro, and Parquet.

Q: What is the current state of tools that can automate some of this process (i.e. are there tools that can look at how an app works with data and recommend a better file format or data structure, and perhaps compile or transform it into these more efficient formats like traditional programming language compilers)?

A: When it comes to encoding data in any of the Hadoop-optimized formats, such as Avro and Parquet, there are great open-source libraries available for practically any programming language known to humankind. Converting data stored somewhere else (for example in the database) to the Avro or Parquet datasets and storing them in the distributed file system can be performed using home-grown tools or (preferred) a good ETL tool, which can connect to any source, map, transform, and store data in the Hadoop cluster. Below are some of the any-to-any ETL tools which support various sources, Avro and Parquet as a destination format, and Hadoop Distributed File System (HDFS) as a destination:

  • Etlworks
  • Talend
  • Informatica

The other trend is real-time data streaming that uses distributed platforms such as Apache Kafka. The Alooma enterprise pipeline is a good example of the service, which can be used to stream large datasets to analytical platforms using datasets encoded as Avro or Parquet.

Additionally, there is a growing list of tools which can be used to inject data into the column-based data warehouses, built for the cloud, such as Snowflake and Amazon Redshift. Depending on your requirements you might consider the following tools:

  • Matillion ETL
  • Etlworks Integrator
  • AWS Glue

Finally, selecting the best format for the job is not something any tool should be automatically doing for you, but having multiple options is a must. The ability to experiment and prototype with different formats and storage options, as well as injection techniques (batch load, streaming, SQL, etc.) is why a good ETL tool goes a long way. It saves you time, money and provides a peace of mind.


Q: What are your thoughts on the best practices for finding the balance between identifying the most efficient data structures for a given analytics use case, and quickly experimenting with different analytics use cases?

A: This article describes how Uber went from the general use database, to JSON-encoded datasets, to the Hadoop-based big data platform. Granted, not all business require something as extreme, so common sense is your best friend. A typical relational database such as MySQL or PostgreSQL is perfectly capable of storing and processing relatively large analytical datasets (millions of records). Models with hundreds of millions and billions of records require specialized solutions, using either column-based data warehouses or Hadoop-like clusters with a parallel file system.