Data Integration Project: Steps to Success

Data integration involves bringing together information from disparate sources in order to generate meaningful insight. Reconciling data generated from the software, equipment, and personnel across all of the functional areas of your business can provide you with the valuable information you need to make the right decisions.

However, combining these different sources of data can be a complex, and honestly, headache-inducing challenge.

It can be quite difficult to know the exact scope of a data integration project, particularly when data is siloed across many different functional areas of your business.

Choosing the right approach to integration can save you money and help ensure integration project success. Choosing unwisely can increase the cost of doing integration or prevent you from meeting business-related objectives. In order to maximize success and minimize re-work, a business evaluation should start and guide each integration effort.

Before initiating any data integration project, it is necessary to take the right steps in preparation. While every company has its own unique needs, there are some general tips you can follow in order to achieve smooth and successful data integration.

Find the Best Data Integration Provider

There are many vendors out there with various data integration solutions that are both efficient and resourceful. Therefore, choosing the most suitable data integration solution for your business should be your number one priority.

Finding the right vendor who can overcome all the data integration challenges while implementing the right data management strategy with timely delivery and speed is the most important piece of the puzzle.

Establish a Data Governance Process

To unlock your data’s full value, you should establish and implement a set data governance process in your organization. This process needs to prioritize and include managing risks, data quality, business processes, and data management as a whole.

Having a set data governance policy in place will help you improve your operational processes. Also, it will help you ensure that your data is present in the right format, with the right quality and maximum availability for your stakeholders.

Implement Data Security

Businesses in touch with the latest data integration trends also need to find a way to safely and securely connect on-premise data using different cloud applications and systems.

Taking action on this subject should be a priority, considering the large volume of data that keeps growing.

Data integration projects can be complex, slow, and frustrating. However, it does not have to be the reality. 

With Etlworks robust and intuitive solution, data integration is quite simple and straightforward and the data integration project is easy and fast.

What is reverse ETL? And why is it valuable?

Reverse ETL is a new key component of the modern data stack that enables “operational analytics.”

ETL/ELT

Before defining “reverse ETL”, let’s briefly talk about plain old ETL. Extract, transform, and load (ETL) is a data integration methodology that extracts raw data from sources, transforms the data on a secondary processing server, and then loads the data into a target database.

ETL is nothing new. The concept was actually popularized in the 1970s.

More recently, with the rise of cloud data warehouses, extract, load, transform (ELT) is beginning to replace ETL. Unlike the ETL method, ELT does not require data transformation before the loading process. ELT loads raw data directly into a cloud data warehouse. Data transformations are executed inside the data warehouse via SQL pushdowns, Python scripts, and other code.

ETL and ELT both transfer data from third-party systems, such as business applications (Hubspot, Zendesk, Salesforce) and/or databases (Oracle, MySQL), into target data warehouses. But with reverse ETL, the data warehouse is the source, rather than the target. The target is a third-party system.

What is reverse ETL?

Reverse ETL is the exact inverse process of ETL covered above. Simply put, it’s the process of copying data from the data warehouse to Saas products used by organizations.

Why would I move data out of the warehouse?  Well, companies today increasingly engage in operational analytics – an approach consisting in making data accessible to “operational” teams, for operational use cases (sales, marketing, ..). We distinguish it from the more classical approach of using data stored in the warehouse only for reporting and business intelligence. Instead of using data to influence long-term strategy, operational analytics informs strategy for the day-to-day operations of the business. To put it simply, it’s putting the company’s data to work so everyone in your organization can make smarter decisions.

Reverse ETL is the tool responding to the new practice of operational analytics. It can be seen as a bridge between your data warehouse and cloud applications. Reverse ETL tools move the data out of your warehouse to the SaaS products your team loves and uses. For example, the Sales team wants the list of webinar attendees to import as leads into Salesforce, the Support team wants to see on Zendesk the data about accounts with premium support, Finance team wants a CSV of rolled up transaction data to use in Excel or Google Sheets.

The data needed is already available in the data warehouse and with reverse ETL, all you really need is to extract data from the warehouse and sync it to external tools, making it the simplest solution.

Etlworks and reverse ETL

Creating a Reverse ETL pipeline from scratch for such data is a complex process since businesses will have to utilize a high amount of resources to develop it and then ensure that it can keep up with the increased data volume and Schema variations.

Etlworks is a cloud-native data integration platform that allows you to perform ELT or reverse ETL with the help of your cloud data platform.

Etlworks helps you directly transfer data from a source of your choice, such as Snowflake, Amazon Redshift, etc., to any SaaS applications, CRMs such as Salesforce, HubSpot, etc., Support tools such as Zendesk, Jira, etc., in a fully automated and secure manner without having to write the code repeatedly. It will make your life easier and make data migration hassle-free.

Want to take Etlworks for a ride? Sign Up for a 14-day free trial and see the difference!  Check out the pricing model to get a better understanding of which plan suits you the most.

Data integration platform vs. in-house solution

In-house data solutions – otherwise known as the small thing that someone built ages ago and now it’s somehow grown into a critical part of your data infrastructure and you’d rather not touch it because who knows how it all works? Also, some companies are afraid of risks of migration, so they are pursuing the “if it ain’t broke, don’t fix it” strategy which is not always the most efficient or cost-effective. 

It’s a very common scenario. But those in-house solutions can quickly become harder and harder to manage and extend to support increasingly complicated requirements.

At some point you face the decision: do you keep improving an in-house solution or do you reassess your needs and look around for a possible off-the-shelf solution?

If you are reading this article, you are in a position when you are:

  • Tired of dealing with the complicated inputs and outputs and don’t want to write a custom code anymore.
  • Need a way to connect to all your data, regardless of its format and location.
  • Need a simple way to transform your data from one format to another.
  • Need to track changes in your transactional database and push them to your data warehouse.
  • Need to connect to the external and internal APIs with different authentication schemas, requests, and responses.
  • Want to create new APIs with just a few mouse clicks, without writing any code.
  • Prefer not to write any code at all.
  • Want not to worry about backups, performance, and job monitoring.
  • Want someone to manage the data integration tool for you.

By selecting Etlworks as your data integration platform, you will be able to implement complex data integration flows with fewer steps and faster.

The key advantages of the Etlworks:

  • It can read data from all your sources and load it into all your destinations, including most databases, file storage systems and more than 150 SaaS applications.
  • It automatically parses even the most complicated JSON and XML documents (as well as other formats) and can connect to all your APIs and databases.
  • It is built for Cloud but works equally well when installed on-premise.
  • You can visualize and explore all your data, regardless of the format and location, before creating any data integration flows.
  • You probably won’t need to write any code at all, but even if you will, it will be just a few lines in the familiar programming language: SQL, JavaScript, and Python.
  • You can use SQL to extract and flatten data from nested documents.
  • Etlworks is a full-fledged enterprise service bus (ESB) so you can create data integration APIs with just a few mouse clicks.
  • Etlworks can integrate data behind the corporate firewall when working together with data integration agent.
  • We provide world-class support for all customers.
  • Our service is so affordable that you won’t have to get a board approval in order to use it.
  • No sales call necessary! Sign up and start using it right away.

In the new post-COVID reality, the pressure to do more with less is higher than ever before. By leveraging modern managed integration solutions, your company get a chance not only to save money but also to gain a competitive advantage. 

Etlworks is  solving data integration challenges since 2016. We are working with companies large and small around the world, in industries such as finance, healthcare, entertainment and consulting, to help them build and manage better data pipelines and deliver better data outcomes.

What is a Data Pipeline?

Etlworks is the leading provider of cloud-based managed data pipelines.

To gain business insights for competitive advantage every business these days is seeking ways to integrate data from multiple sources. Data and data analytics are critical to business operations, so it’s important to engineer and deploy strong and maintainable data pipelines.

data pipeline is a set of actions that ingest raw data from disparate sources and move the data to a destination for storage and analysis. Most of the time, though, a data pipeline is also to perform some sort of processing or transformation on the data to enhance it.

Data pipelines often deliver mission-critical data and for important business decisions, so ensuring their accuracy and performance is required whether you implement them through data integration and ETL platforms, data-prep technologies, or real time data streaming architectures.

How is a data pipeline different from ETL?

You may commonly hear the terms ETL and data pipeline used interchangeably. ETL stands for Extract, Transform, and Load. The major dissimilarity of ETL is that it focuses entirely on one system to extract, transform, and load data to a particular data warehouse. Alternatively, ETL is just one of the components that fall under the data pipeline.

ETL pipelines move the data in batches to a specified system with regulated intervals. Comparatively, data pipelines have broader applicability to transform and process data through streaming or real-time.

Data pipelines do not necessarily have to load data to a database or data warehouse. It might be loaded to any number of targets, such as an AWS bucket or a data lake, or it might even trigger a webhook on another system to kick off a specific business process.

Data pipeline solutions

The nature and functional response of data pipeline would be different from cloud tools to migrate data to outright use it for a real-time solution. The following list shows the most popular types of pipelines available. Note that these systems are not mutually exclusive. For example, you might have a data pipeline that is optimized for both cloud and real-time.

Cloud-Based

The cost-benefit ratio of using cloud-based tools to integrate data is quite high. These tools are hosted in the cloud, allowing you to save money on infrastructure and expert resources because you can rely on the infrastructure and expertise of the vendor hosting your pipeline.

Batch

Batch processing allows you to easily transport a large amount of data at interval without having to necessitate real-time visibility. The process makes it easier for analysts who combine a multitude of marketing data to form a decisive result or pattern.

Real-Time or Streaming

Real-time or streaming processing is useful when an organization processes data from a streaming source, such as the data from financial markets or internet of things (IoT) devices and sensors. Real-time processing captures data as it comes off the source systems in real time, performs rudimentary data transformations (filters, samples, aggregates, calculates averages, determines min/max values) before firing off data to the downstream process.

Open Source

Open Source tools are ideal for small business owners who want lower cost and over-reliance on commercial vendors. However, the usefulness of such tools requires expertise to use the functionality because the underlying technology is publicly available and meant to be modified or extended by users.

Data pipeline use cases

Data Migration

Data pipelines are used to perform data migration tasks. These might involve moving data from databases, e.g. MongoDB, Oracle, Microsoft SQL Server, PostgreSQL, and MySQL into the cloud. Cloud databases are scalable and flexible and enable for easier creation of other data pipelines that use real-time streaming.

Data Warehousing and Analysis

Probably the most common destination for a data pipeline is a dashboard or suite of analytical tools. Raw data that is structured via ETL can be loaded into databases for analysis and visualization. Data scientists can then create graphs, tables and other visualizations from the data. This data can then be used to inform strategies and guide the purpose of future data projects.

AI and Machine Learning Algorithms

ETL and ELT pipelines can move data into machine learning and AI models. Machine learning algorithms can learn from the data, performing advanced parsing and wrangling techniques. These ML models can then be deployed into various software. Machine learning algorithms fed by data pipelines can be used in marketing, finance, science, telecoms, etc.

IoT Integration

Data pipelines are frequently used in IoT systems that use networks of sensors for data collection. Data inducted from various sources across a network can be transformed into data available for ready analysis. For example, an ETL pipeline may perform numerous calculations on huge quantities of delivery tracking information, vehicle locations, delay expectations, etc, to form a rough ETA estimate.

Getting started with a data pipeline

Setting up a reliable data pipeline doesn’t have to be complex and time-consuming. Etlworks can help you solve your biggest data collection, extraction, transformation, and transportation challenges. Sign up for Etlwoks for free and get the most from your data pipeline, faster than ever before.

Real-World Change Data Capture Example

Change data capture (CDC) is a necessity. With CDC changes are synced instantly or near-instantly. In practice, CDC is often used to replicate data between databases in real-time. Essentially, CDC it is a perfect solution for non-intrusive, low-impact real-time data integration.

I will illustrate how the Etlworks change data capture solution works for eLearning company that needs to load data from 1600+ MySQL databases into Snowflake data warehouse. Thanks to Etlworks, company now can make everyday operations data available to non-technical business users. Data warehouse in Snowflake is updated every 5 minutes so users can gain quick, current business insights.

The company requirements

  • Setup data pipeline to load incremental updates from 1600+ MySQL databases into the Snowflake data warehouse.
  • The data pipeline must support INSERTs, UPDATEs, and DELETEs.
  • The data pipeline must be able to automatically pickup new databases and adjust the destination schema if new columns are added to the source tables.
  • The expected volume of data: hundreds of gigabytes, billions of records on initial load, tens of millions updates every day. 
  • The expected number of tables across all MySQL databases: 55000.
  • The number of tables in Snowflake: 35
  • The data pipeline must be extremely resilient to the extract and load errors.
  • The data pipeline is expected to work in a fully automated mode.

Etlworks Solution

Step 1: Setting up MySQL read replica instance in RDS

Etlworks team recommended to stream CDC events from the MySQL read replica.  Our customer setup the MySQL instance in Amazon RDS and configured native MySQL replication from the production instance to the replica. Creating a read replica is an optional step, but it is significantly lessen a load of replication on your MySQL production instance.

The following permissions need to be configured for the MySQL user in the read replica MySQL instance.

PermissionDescription
SELECTIt enables the connector to select rows from tables in databases.
RELOADIt enables the connector the use of the FLUSH statement to clear or reload internal caches, flush tables, or acquire locks.
SHOW DATABASESIt enables the connector to see database names by issuing the SHOW DATABASE statement.
REPLICATION SLAVEIt enables the connector to connect to and read the MySQL server binlog.
REPLICATION CLIENTIt enables the connector the use of following statements: SHOW MASTER STATUS, SHOW SLAVE STATUS, and SHOW BINARY LOGS
LOCK_TABLES Amazon RDS or Amazon Aurora that do not allow a global read lock, table-level locks are used to create a consistent snapshot

The binlog replication for MySQL read replica instance needs to be enabled by setting the following Amazon RDC parameters:

binlog_format: ROW
log_bin_use_v1_row_events: 1
net_read_timeout: 3600
net_write_timeout: 3600
wait_timeout: 86400
Step 2: Setting up flows to extract data from MySQL using CDC

In Etlworks there are two options for extracting data using CDC and loading into the Snowflake:

  1. Create a separate flow to extract data from the database and create CSV files in the local storage and another flow to load CSV files into the Snowflake.
  2. Create a separate flow to extract data from the database and ingest into the messaging queue, such as Kafka.  Create another flow to load data from the queue into the Snowflake.

For this project, we decided to use option 1. Option 2 requires a complicated setup with a separate message queue such as Kafka or Azure Event Hubs.

Typical CDC flow can extract data from multiple tables in multiple databases but having a single flow pulling data from 55000+ tables would be a major bottleneck as it would be limited to a single blocking queue with a limited capacity. It would also create a single point of failure.

The better approach would be to create multiple parallel extract flows, each pulling data from all 35 Snowflake tables in a single database. However, considering that the flow must extract data from 1600 MySQL databases it would be impractical and very resource demanding – we would have to run 1600+ parallel extract flows. 

We decided to create 35 extract flows where each flow is pulling data from a single table in 1600 databases

There are other possible topologies to consider as well, for example:

  • group tables and databases alphabetically;
  • create separate extract flows for the large and high-traffic tables;
  • for the rest of the tables split flows in chunks, each extracting data from a significant number of tables across multiple databases (for example 1000) ;

We recommend selecting the topology that works the best for you, keeping in mind the performance and the maintenance overhead. 

Considering the design choice, we created a table in the Snowflake which has a list of databases to extract data from. This table is used to automatically populate the list of the included tables in a format: database1.table_abc,database2.table_abc,...database1500.table_abc and included databases in a format database1,database2,...database1500.

The basic idea is that we can use a single MySQL CDC connection where the Included Databases and Included Tables are set as {tokens}, populated at run-time by JavaScript.

Step 3: Setting up flows to load data in Snowflake

Once again, there are multiple options to consider.

Option 1. Initially we setup a flow that loads the staged CSV files into Snowflake by using the CDC MERGE action which applies INSERTs/UPDATEs/DELETEs in order in which CDC events were originated in the source database.

For each CSV file the flow does the following:

  1. Creates a temporary table in Snowflake.
  2. Executes COPY INTO command to load the file “as is” into the temp table.
  3. Uses Snowflake MERGE command to merge data in the temp table with the data in the actual table.

This approach guarantees the smallest latency but is more resource consuming and requires more Snowflake credits (can be more expensive from the Snowflake standpoint). 

Option 2. Skip Delete events and only apply INSERTs and UPDATEs. It can be a bit faster compared to option 1. 

Option 3. Always INSERT data into the staging tables in Snowflake, then periodically execute a SQL script to populate the actual tables by de-duplicating the staging tables and removing ‘d’ (delete) records. 

This option is very popular when customers are OK with longer delays between data being updated in the source and finally available in the data warehouse for consumption by the BI tools. 

After extensive testing, our customer decided to use option 3. The main reason was the fact that [almost] real-time approach (option 1) uses Snowflake cloud services and consumes extra Snowflake credits.

Step 4. Scheduling extract and load flows

The Extract Flows

In Etlworks, it is possible to schedule a flow to run continuously until there is nothing to do, then stop for a configurable number of seconds and restart. We recommended this schedule type for 35 extract flows.  The extract flows are running until they are automatically stopped to let the system add new databases. The customer set the delay between restarts to 2 hours. 

The load flow

The load flow is loading files into Snowflake in batches. We recommended running it every few minutes so it could clear the queue as often as possible. The customer set it to run every 5 minutes. 

On average the pipelines load tens of millions of records into Snowflake daily, but there are days when the number of records jumps to hundreds of millions. As configured the pipelines can easily handle extracting and loading billions of records a day.

Conclusion

The Etlworks platform enables Snowflake users to leverage the low-impact, real-time change data capture from all major databases: MySQL, Postgres, SQL Server, Oracle, DB2, MongoDB.

Moving data continuously, as new database transactions occur makes it possible for Snowflake users to maintain the real-time data pipelines that feed Snowflake’s fast and flexible storage and analytics solutions.

If you would like a brief demo of CDC to Snowflake, please schedule a demo.

%d bloggers like this: