Real-World Change Data Capture Example

Change data capture (CDC) is a necessity. With CDC changes are synced instantly or near-instantly. In practice, CDC is often used to replicate data between databases in real-time. Essentially, CDC it is a perfect solution for non-intrusive, low-impact real-time data integration.

I will illustrate how the Etlworks change data capture solution works for eLearning company that needs to load data from 1600+ MySQL databases into Snowflake data warehouse. Thanks to Etlworks, company now can make everyday operations data available to non-technical business users. Data warehouse in Snowflake is updated every 5 minutes so users can gain quick, current business insights.

The company requirements

  • Setup data pipeline to load incremental updates from 1600+ MySQL databases into the Snowflake data warehouse.
  • The data pipeline must support INSERTs, UPDATEs, and DELETEs.
  • The data pipeline must be able to automatically pickup new databases and adjust the destination schema if new columns are added to the source tables.
  • The expected volume of data: hundreds of gigabytes, billions of records on initial load, tens of millions updates every day. 
  • The expected number of tables across all MySQL databases: 55000.
  • The number of tables in Snowflake: 35
  • The data pipeline must be extremely resilient to the extract and load errors.
  • The data pipeline is expected to work in a fully automated mode.

Etlworks Solution

Step 1: Setting up MySQL read replica instance in RDS

Etlworks team recommended to stream CDC events from the MySQL read replica.  Our customer setup the MySQL instance in Amazon RDS and configured native MySQL replication from the production instance to the replica. Creating a read replica is an optional step, but it is significantly lessen a load of replication on your MySQL production instance.

The following permissions need to be configured for the MySQL user in the read replica MySQL instance.

PermissionDescription
SELECTIt enables the connector to select rows from tables in databases.
RELOADIt enables the connector the use of the FLUSH statement to clear or reload internal caches, flush tables, or acquire locks.
SHOW DATABASESIt enables the connector to see database names by issuing the SHOW DATABASE statement.
REPLICATION SLAVEIt enables the connector to connect to and read the MySQL server binlog.
REPLICATION CLIENTIt enables the connector the use of following statements: SHOW MASTER STATUS, SHOW SLAVE STATUS, and SHOW BINARY LOGS
LOCK_TABLES Amazon RDS or Amazon Aurora that do not allow a global read lock, table-level locks are used to create a consistent snapshot

The binlog replication for MySQL read replica instance needs to be enabled by setting the following Amazon RDC parameters:

binlog_format: ROW
log_bin_use_v1_row_events: 1
net_read_timeout: 3600
net_write_timeout: 3600
wait_timeout: 86400
Step 2: Setting up flows to extract data from MySQL using CDC

In Etlworks there are two options for extracting data using CDC and loading into the Snowflake:

  1. Create a separate flow to extract data from the database and create CSV files in the local storage and another flow to load CSV files into the Snowflake.
  2. Create a separate flow to extract data from the database and ingest into the messaging queue, such as Kafka.  Create another flow to load data from the queue into the Snowflake.

For this project, we decided to use option 1. Option 2 requires a complicated setup with a separate message queue such as Kafka or Azure Event Hubs.

Typical CDC flow can extract data from multiple tables in multiple databases but having a single flow pulling data from 55000+ tables would be a major bottleneck as it would be limited to a single blocking queue with a limited capacity. It would also create a single point of failure.

The better approach would be to create multiple parallel extract flows, each pulling data from all 35 Snowflake tables in a single database. However, considering that the flow must extract data from 1600 MySQL databases it would be impractical and very resource demanding – we would have to run 1600+ parallel extract flows. 

We decided to create 35 extract flows where each flow is pulling data from a single table in 1600 databases

There are other possible topologies to consider as well, for example:

  • group tables and databases alphabetically;
  • create separate extract flows for the large and high-traffic tables;
  • for the rest of the tables split flows in chunks, each extracting data from a significant number of tables across multiple databases (for example 1000) ;

We recommend selecting the topology that works the best for you, keeping in mind the performance and the maintenance overhead. 

Considering the design choice, we created a table in the Snowflake which has a list of databases to extract data from. This table is used to automatically populate the list of the included tables in a format: database1.table_abc,database2.table_abc,...database1500.table_abc and included databases in a format database1,database2,...database1500.

The basic idea is that we can use a single MySQL CDC connection where the Included Databases and Included Tables are set as {tokens}, populated at run-time by JavaScript.

Step 3: Setting up flows to load data in Snowflake

Once again, there are multiple options to consider.

Option 1. Initially we setup a flow that loads the staged CSV files into Snowflake by using the CDC MERGE action which applies INSERTs/UPDATEs/DELETEs in order in which CDC events were originated in the source database.

For each CSV file the flow does the following:

  1. Creates a temporary table in Snowflake.
  2. Executes COPY INTO command to load the file “as is” into the temp table.
  3. Uses Snowflake MERGE command to merge data in the temp table with the data in the actual table.

This approach guarantees the smallest latency but is more resource consuming and requires more Snowflake credits (can be more expensive from the Snowflake standpoint). 

Option 2. Skip Delete events and only apply INSERTs and UPDATEs. It can be a bit faster compared to option 1. 

Option 3. Always INSERT data into the staging tables in Snowflake, then periodically execute a SQL script to populate the actual tables by de-duplicating the staging tables and removing ‘d’ (delete) records. 

This option is very popular when customers are OK with longer delays between data being updated in the source and finally available in the data warehouse for consumption by the BI tools. 

After extensive testing, our customer decided to use option 3. The main reason was the fact that [almost] real-time approach (option 1) uses Snowflake cloud services and consumes extra Snowflake credits.

Step 4. Scheduling extract and load flows

The Extract Flows

In Etlworks, it is possible to schedule a flow to run continuously until there is nothing to do, then stop for a configurable number of seconds and restart. We recommended this schedule type for 35 extract flows.  The extract flows are running until they are automatically stopped to let the system add new databases. The customer set the delay between restarts to 2 hours. 

The load flow

The load flow is loading files into Snowflake in batches. We recommended running it every few minutes so it could clear the queue as often as possible. The customer set it to run every 5 minutes. 

On average the pipelines load tens of millions of records into Snowflake daily, but there are days when the number of records jumps to hundreds of millions. As configured the pipelines can easily handle extracting and loading billions of records a day.

Conclusion

The Etlworks platform enables Snowflake users to leverage the low-impact, real-time change data capture from all major databases: MySQL, Postgres, SQL Server, Oracle, DB2, MongoDB.

Moving data continuously, as new database transactions occur makes it possible for Snowflake users to maintain the real-time data pipelines that feed Snowflake’s fast and flexible storage and analytics solutions.

If you would like a brief demo of CDC to Snowflake, please schedule a demo.

What is an API integration?

An API integration is the connection between two or more applications, via their APIs, that lets those systems exchange data. API integrations power processes throughout many high-performing businesses that keep data in sync, enhance productivity, and drive revenue.

What is an API?

An API is a set of definitions and protocols for building and integration application software. API stands for application programming interface.

An API is like a building block of programming that helps programmers avoid writing code from scratch each time it is needed. APIs are organized sets of commands, functions, and protocols that programmers use to develop software.

Consider building something with a box of Lego bricks. Instead of creating or carving a new block each time you need one, you simply choose from an assortment that are prepared and ready to be plugged into your project. Each block is designed to connect with other blocks in order to speed up the building process. In essence, that’s how an API works. APIs streamline and boost efficiency everywhere they are used.

The most commonly-used API for web services: REST

REST stands for Representational State Transfer. Where a regular API is a essentially a set of functions that allow communication between applications, a REST API is a architectural style for connected applications on the web.

In fact, today 70% of public APIs depend on REST services. REST APIs offer more flexibility, a gentler learning curve, and work directly from an HTTP URL rather than relying on XML.

How do REST APIs work?

At their simplest form, REST APIs for web services usually involve the following parties:

  • Your web-based, API-enabled application
  • Remote server
  • Specific data request
  • Returned data/function

While there are many different flavors of software and many different flavors of server, REST APIs act as a standardized wrapper to help your API-enabled applications successfully communicate with online servers to make information requests.

Now, how do API integrations work?

The term API integration can be defined as the process of creating a means for two or more APIs to share data and communicate with each other without human interruption. It involves the use of APIs to enable communication between two web tools or applications. It allows organizations to automate their systems, enhance the seamless sharing of data, and integrate current applications.

API integration has become pivotal in the modern world due to the explosion of cloud-based products and apps. As such, organizations need to create a connected system where data will be relayed flawlessly between various software tools without the need to do it manually. API integration has proved to be the much-needed solution as it allows the sharing of process and enterprise data among applications in a given ecosystem. It improves the flexibility of information and service delivery, as well as makes the embedding of content from different sites and apps easy. An API acts as the interface that permits the integration of two applications.

Building API integration into your application or workflow allows you to make the most of your data by employing it where it will be most useful. Whether this means using it for analytics or as content within your software or applications, integration keeps you connected to your business information.

Getting started with APIs

Data integration using API technology doesn’t have to be daunting. Etlworks Integrator uses API-led connectivity to create an application network of apps, data, and devices, both on-premises and in the cloud. It provides developers with a complete toolkit to enhance scalability, simplify mapping and security, and for managing, preparing, and profiling all of your company’s data. Whether you want to build event-driven data integration flows or incorporate Integrator into your own automation workflow, you can use the developer features below to achieve your goals.

See how Etlworks can help you make the most of API with a free demo of Data Integration Platform.

What is Data Integration?

In 2020, it is definitive to argue that one of a company’s most significant assets is its Data.

Data integration involves collecting data from different sources, combining this data into a unified set that can easily be used for a predetermined business function, and enabling you to query and manipulate all of your data from a single interface and derive analytics and statistics. While the sources and types of data continue to grow, it becomes increasingly important to be able to perform quality analysis on that data.

Benefits of Data integration

Data integration is becoming more and more common, as numerous apps and companies race to meet consumer demand to have all of their data collected in one place and in a useful format.

From small businesses that want to spend less employee time creating reports, to large corporations taking a deep dive into an array of user statistics, data integration is indispensable for businesses that want to compete in today’s economy.

Here are the primary benefits of data integration offered to businesses.

Integrate data from multiple sources

Businesses are using tons of applications, systems, and data warehouses within your organization. Nevertheless, these data sources are disparate and siloed. A data silo, is a repository of data that is isolated. Generally, in businesses, this means that the information is under the control of a business unit or department and is not available across the organization. When systems containing valuable data are integrated across an organization, then the data gathered by one app or department can benefit the company as a whole, not just a team or individual.

Inter-system cooperation ensures that your company is free from information silos that benefit the few, and allow data gathered across the organization to be viewed and used by anyone who needs access to it.

For example, integrating data from multiple online stores can give you a more complete understanding of customer behavior and payment processing preferences.

Data integrity and data quality

Data integrity is an important element of data integrations. Data integrity is the assurance of the consistency and quality of the data through its entire lifecycle.
Nevertheless, most of the data is still of poor quality. Eliminating errors in data sets used for business intelligence and decision making is one of the primary advantages of data integration.

When data is gathered manually, then every tool, database and user account must be properly accounted for and set up prior to collecting this data, meaning that any data source overlooked or added last minute results in an incomplete data set as the end result. This also means that reporting protocols must be re-worked whenever a new data source is added.

When data systems are integrated properly, however, errors from overlooked sources do not occur and reports can be run and accessed in real time.

Save time and boost efficiency

When a company takes measures to integrate its data properly, it cuts down significantly on the time it takes to prepare and analyze that data. The automation of unified views cuts out the need for manually gathering data, and employees no longer need to build connections from scratch whenever they need to run a report or build an application.

Additionally, using the right tools, rather than hand-coding the integration, returns even more time (and resources overall) to the dev team.
When systems are properly integrated, collecting data and converting it into its final, usable format becomes a quick, easy task instead of a treasure hunt across your company’s data assets.

All the time saved on these tasks can be put to other, better uses, with more hours earmarked for analysis and execution to make an organization more productive and competitive.

Increase competitiveness

This is the most straightforward from all the other benefits of data integration.

Having a data integration strategy in place can help you to plan what actions you need to take to improve the accessibility of data both externally and internally, you will be able to influence many vital parts of your business. Yes, the overall goal is to generate more profit. An essential part of that is delighting your customers, so offering better services than your competition cannot be overlooked.

Tools for Data Integration

There are numerous tools available in the market that would help us query out the Data effectively since our Data is not going to integrate itself. To name a few, we have some Open Source Data Integration Tools, Cloud-based Data Integration Tools and the On-premises data integration tools.

Again, the question is that how to and which one to choose among those various tools available in the market.  The features you should look for in a data integration tool are:

A lot of connectors. There are many systems and applications in the world; the more pre-built connectors your Data Integration tool has, the more time your team will save.

Portability. It is important, as companies increasingly move to hybrid cloud models, to be able to build your data integrations once and run them anywhere

Ease of use. Data integration tools should be easy to learn and easy to use with a GUI interface to make visualizing your data pipelines simpler.

A transparent price model. Your data integration tool provider should not ding you for increasing the number of connectors or data volumes.

Cloud compatibility. Your data integration tool should work natively in a single cloud, multi-cloud, or hybrid cloud environment.

Business intelligence, analytics, and competitive edges are all at stake when it comes to data integration. That is why it is critical for your company to have full access to every data set from every source.

Simple data integration with Etlworks

Etlworks is an all-in-one, any-to-any data integration platform for all your ETL projects, regardless of the complexity, data location, format and volume. There is no code to write — with just a few clicks, Etlworks will extract your data from wherever it lives and get it ready to be analyzed, understood, and acted upon.

Etlworks offers a free 14-day trial. Give it a try today!

Version Control and Why You Need It

data-integration-essential-questions

Data integration is becoming a hot topic.

Business intelligence applications can make use of a comprehensive set of information provided through data integration to derive important business insights from a company’s historic and current data. By providing executives and managers with an in-depth understanding of the company’s current operations, as well as the opportunities and risks it faces in the marketplace, data integration can have a direct bottom-line impact.

In the past, the job of integrating systems was a complex, often costly, frequently cumbersome process, involving on-premise integration tools. These days a new breed of applications, SaaS (Software-as-a-service), is taking over as the integration platform of choice for new integration projects. There is a lot to consider when the developers and IT management approach a data integration project.

The features like ease of use, scalability and flexibility, real-time integration, security, and compliance are essential. This blog will talk about the importance of one feature that often gets overlooked – version control.

Why the version control is important?

Because the data integration tool is a software that is used in performing a data integration process on the data source moving the data to the destination. And, as you are planning to work on a big software project that consists of technical concepts, requires a collaboration of different team members and needs frequent changes, you need to use a version control system (VCS).

Version control is essential to track, organize and control changes over source code and avoid confusion, especially for large, fast-changing projects, like data integration.

Etlworks is a modern, cloud-first, any-to-any data integration platform that supports version control for connections, formats, listeners, flows & macros.

Etlworks Integrator automatically tracks changes for all artifacts: connections, formats, listeners, flows, macros, and schedules. There is nothing to configure except the retention policy.

Version control is supported for the following artifacts: flows, connections, formats, listeners, schedules, macros.

Using Etlwork’s built-in version control you can:

  • View the history of changes – who-changed-what-and-when.
  • Compare any two versions.
  • Revert to any previous version.
  • Add comments to the commit when saving the artifacts.

Etlworks version control 

To access the version control UI click the [@] button at the bottom left corner of each screen.

mceclip0

Then, click the [@] button at the bottom left corner of each screen to view the history of changes.

mceclip4

You can compare any two versions by selecting versions and click [View Diff between Selected]. mceclip5

The changes are displayed in a two-panel or unified view. From this view, you can navigate back and forward in the history of changes and also revert to any version.mceclip12

To view the specific version and compare it with a previous one click the [eye] button.mceclip6

To revert to any previous version:

Step 1. Click the [Revert] button.mceclip7

Step 2. Confirm that you want to revert to the previous version by clicking the Revert button.

Step 3. Click [Save] or [Save with message] to complete the revert.mceclip9

When saving any change, to any artifact you have an option to add a message (comments) to the commit. To add a message to the commit click the [Save with message] button.mceclip10

The commit message is displayed in the Change History popup.mceclip11

In Conclusion

After reading this blog, hopefully, you have a better understanding of version control. Version control is very helpful for organizing and backing up artifacts you are working with. It is also helpful for multiple people working on one artifact. Version Control is a great way to keep your artifacts organized and backed up in case of the worse.

Begin your data journey, get 14-day free trial!

EDI Data Integration & Why It’s Important

EDI_data_integration

Today more than 85% of all electronic business transactions take place utilizing Electronic Data Interchange (EDI). This creates major competitive advantages for businesses and their partners. The EDI process has proven to be the most efficient and secure process to exchange business documents globally. EDI is a necessary component to ensure that your organization is achieving operational excellence.

What is EDI?

Electronic Data Interchange is the computer-to-computer exchange of business documents in a standard electronic format between business partners. By moving from a paper-based exchange of business document to one that is electronic, businesses enjoy major benefits such as reduced cost, increased processing speed, reduced errors and improved relationships with business partners.

EDI provides a safe, reliable, secure, and established method of exchanging documents for all industries. The retail industry uses EDI documents such as the Purchase Order and Invoice. The healthcare industry relies almost entirely on the HIPAA healthcare claim and corresponding payment EDI documents.

Why is EDI Necessary?

The adoption of electronic data interchange is critical for companies of all sizes to maintain their competitiveness in the marketplace. Over three decades of global usage has proven that EDI improves operational efficiency across your entire organization.

As an automation technology, EDI delivers core business benefits:

  • Saves time and money: automates a process previously manually executed with paper documents.
  • Improves efficiency and accuracy: data-entry errors are eliminated.
  • Improves traceability and reporting: electronic documents can be integrated with a range of IT systems to support data collection, visibility, and analysis.
  • Improves relationships with your customers: enables efficient transaction execution and prompt, reliable product and service delivery.

EDI is important to both large and small businesses. For large organizations, EDI enables standards to be instituted across trading partners to consistently achieve benefits. For smaller organizations, adherence to EDI offers greater integration with larger firms that have big budgets and strong influence.

Metalanguages like XML, JSON, and API integration complement, rather than replace EDI. Companies must be ready to handle an ever-increasing number of document formats and transmission options.

Why choose Etlworks for EDI Integration?

Etlworks allows your team to easily automate the handling, process, and integration of your electronic data exchange information. You’ll increase your organization’s communication, customer service, and cash flow by automating your processes.

With Etlworks, you can automate data flows that generate EDI messages from internal data or process incoming EDI messages and integrate them with your internal applications and databases, improving process control.

The EDI connector in Etlworks Integrator contains components that convert messages between EDI and XML and vice versa.

Etlworks supports numerous EDI formats — HL7, EDI X12, EDIFACT, FHIR, JSON, Flat File, CSV, Delimited, XML, AVRO and more.

EDI is only the tip of what we have the capability to handle. If you’d like to learn more about how we can streamline your EDI software integration, request a personalized demo from an Etlworks expert.