Reverse ETL is a new key component of the modern data stack that enables “operational analytics.”
Before defining “reverse ETL”, let’s briefly talk about plain old ETL. Extract, transform, and load (ETL) is a data integration methodology that extracts raw data from sources, transforms the data on a secondary processing server, and then loads the data into a target database.
ETL is nothing new. The concept was actually popularized in the 1970s.
More recently, with the rise of cloud data warehouses, extract, load, transform (ELT) is beginning to replace ETL. Unlike the ETL method, ELT does not require data transformation before the loading process. ELT loads raw data directly into a cloud data warehouse. Data transformations are executed inside the data warehouse via SQL pushdowns, Python scripts, and other code.
ETL and ELT both transfer data from third-party systems, such as business applications (Hubspot, Zendesk, Salesforce) and/or databases (Oracle, MySQL), into target data warehouses. But with reverse ETL, the data warehouse is the source, rather than the target. The target is a third-party system.
What is reverse ETL?
Reverse ETL is the exact inverse process of ETL covered above. Simply put, it’s the process of copying data from the data warehouse to Saas products used by organizations.
Why would I move data out of the warehouse? Well, companies today increasingly engage in operational analytics – an approach consisting in making data accessible to “operational” teams, for operational use cases (sales, marketing, ..). We distinguish it from the more classical approach of using data stored in the warehouse only for reporting and business intelligence. Instead of using data to influence long-term strategy, operational analytics informs strategy for the day-to-day operations of the business. To put it simply, it’s putting the company’s data to work so everyone in your organization can make smarter decisions.
Reverse ETL is the tool responding to the new practice of operational analytics. It can be seen as a bridge between your data warehouse and cloud applications. Reverse ETL tools move the data out of your warehouse to the SaaS products your team loves and uses. For example, the Sales team wants the list of webinar attendees to import as leads into Salesforce,the Support team wants to see on Zendesk the data about accounts with premium support, Finance team wants a CSV of rolled up transaction data to use in Excel or Google Sheets.
The data needed is already available in the data warehouse and with reverse ETL, all you really need is to extract data from the warehouse and sync it to external tools, making it the simplest solution.
Etlworks and reverse ETL
Creating a Reverse ETL pipeline from scratch for such data is a complex process since businesses will have to utilize a high amount of resources to develop it and then ensure that it can keep up with the increased data volume and Schema variations.
Etlworks is a cloud-native data integration platform that allows you to perform ELT or reverse ETL with the help of your cloud data platform.
Etlworks helps you directly transfer data from a source of your choice, such as Snowflake, Amazon Redshift, etc., to any SaaS applications, CRMs such as Salesforce, HubSpot, etc., Support tools such as Zendesk, Jira, etc., in a fully automated and secure manner without having to write the code repeatedly. It will make your life easier and make data migration hassle-free.
Want to take Etlworks for a ride?Sign Upfor a 14-day free trial and see the difference! Check out the pricing model to get a better understanding of which plan suits you the most.
In-house data solutions – otherwise known as the small thing that someone built ages ago and now it’s somehow grown into a critical part of your data infrastructure and you’d rather not touch it because who knows how it all works? Also, some companies are afraid of risks of migration, so they are pursuing the “if it ain’t broke, don’t fix it” strategy which is not always the most efficient or cost-effective.
It’s a very common scenario. But those in-house solutions can quickly become harder and harder to manage and extend to support increasingly complicated requirements.
At some point you face the decision: do you keep improving an in-house solution or do you reassess your needs and look around for a possible off-the-shelf solution?
If you are reading this article, you are in a position when you are:
Tired of dealing with the complicated inputs and outputs and don’t want to write a custom code anymore.
Need a way to connect to all your data, regardless of its format and location.
Need a simple way to transform your data from one format to another.
Need to track changes in your transactional database and push them to your data warehouse.
Need to connect to the external and internal APIs with different authentication schemas, requests, and responses.
Want to create new APIs with just a few mouse clicks, without writing any code.
Prefer not to write any code at all.
Want not to worry about backups, performance, and job monitoring.
Want someone to manage the data integration tool for you.
By selecting Etlworks as your data integration platform, you will be able to implement complex data integration flows with fewer steps and faster.
The key advantages of the Etlworks:
It can read data from all your sources and load it into all your destinations, including most databases, file storage systems and more than 150 SaaS applications.
It automatically parses even the most complicated JSON and XML documents (as well as other formats) and can connect to all your APIs and databases.
It is built for Cloud but works equally well when installed on-premise.
You can visualize and explore all your data, regardless of the format and location, before creating any data integration flows.
You can use SQL to extract and flatten data from nested documents.
Etlworks is a full-fledged enterprise service bus (ESB) so you can create data integration APIs with just a few mouse clicks.
Etlworks can integrate data behind the corporate firewall when working together with data integration agent.
We provide world-class support for all customers.
Our service is so affordable that you won’t have to get a board approval in order to use it.
No sales call necessary! Sign up and start using it right away.
In the new post-COVID reality, the pressure to do more with less is higher than ever before. By leveraging modern managed integration solutions, your company get a chance not only to save money but also to gain a competitive advantage.
Etlworks is solving data integration challenges since 2016. We are working with companies large and small around the world, in industries such as finance, healthcare, entertainment and consulting, to help them build and manage better data pipelines and deliver better data outcomes.
To gain business insights for competitive advantage every business these days is seeking ways to integrate data from multiple sources. Data and data analytics are critical to business operations, so it’s important to engineer and deploy strong and maintainable data pipelines.
A data pipeline is a set of actions that ingest raw data from disparate sources and move the data to a destination for storage and analysis. Most of the time, though, a data pipeline is also to perform some sort of processing or transformation on the data to enhance it.
Data pipelines often deliver mission-critical data and for important business decisions, so ensuring their accuracy and performance is required whether you implement them through data integration and ETL platforms, data-prep technologies, or real time data streaming architectures.
How is a data pipeline different from ETL?
You may commonly hear the terms ETL and data pipeline used interchangeably. ETL stands for Extract, Transform, and Load. The major dissimilarity of ETL is that it focuses entirely on one system to extract, transform, and load data to a particular data warehouse. Alternatively, ETL is just one of the components that fall under the data pipeline.
ETL pipelines move the data in batches to a specified system with regulated intervals. Comparatively, data pipelines have broader applicability to transform and process data through streaming or real-time.
Data pipelines do not necessarily have to load data to a database or data warehouse. It might be loaded to any number of targets, such as an AWS bucket or a data lake, or it might even trigger a webhook on another system to kick off a specific business process.
Data pipeline solutions
The nature and functional response of data pipeline would be different from cloud tools to migrate data to outright use it for a real-time solution. The following list shows the most popular types of pipelines available. Note that these systems are not mutually exclusive. For example, you might have a data pipeline that is optimized for both cloud and real-time.
The cost-benefit ratio of using cloud-based tools to integrate data is quite high. These tools are hosted in the cloud, allowing you to save money on infrastructure and expert resources because you can rely on the infrastructure and expertise of the vendor hosting your pipeline.
Batch processing allows you to easily transport a large amount of data at interval without having to necessitate real-time visibility. The process makes it easier for analysts who combine a multitude of marketing data to form a decisive result or pattern.
Real-Time or Streaming
Real-time or streaming processing is useful when an organization processes data from a streaming source, such as the data from financial markets or internet of things (IoT) devices and sensors. Real-time processing captures data as it comes off the source systems in real time, performs rudimentary data transformations (filters, samples, aggregates, calculates averages, determines min/max values) before firing off data to the downstream process.
Open Source tools are ideal for small business owners who want lower cost and over-reliance on commercial vendors. However, the usefulness of such tools requires expertise to use the functionality because the underlying technology is publicly available and meant to be modified or extended by users.
Data pipeline use cases
Data pipelines are used to perform data migration tasks. These might involve moving data from databases, e.g. MongoDB, Oracle, Microsoft SQL Server, PostgreSQL, and MySQL into the cloud. Cloud databases are scalable and flexible and enable for easier creation of other data pipelines that use real-time streaming.
Data Warehousing and Analysis
Probably the most common destination for a data pipeline is a dashboard or suite of analytical tools. Raw data that is structured via ETL can be loaded into databases for analysis and visualization. Data scientists can then create graphs, tables and other visualizations from the data. This data can then be used to inform strategies and guide the purpose of future data projects.
AI and Machine Learning Algorithms
ETL and ELT pipelines can move data into machine learning and AI models. Machine learning algorithms can learn from the data, performing advanced parsing and wrangling techniques. These ML models can then be deployed into various software. Machine learning algorithms fed by data pipelines can be used in marketing, finance, science, telecoms, etc.
Data pipelines are frequently used in IoT systems that use networks of sensors for data collection. Data inducted from various sources across a network can be transformed into data available for ready analysis. For example, an ETL pipeline may perform numerous calculations on huge quantities of delivery tracking information, vehicle locations, delay expectations, etc, to form a rough ETA estimate.
Getting started with a data pipeline
Setting up a reliable data pipeline doesn’t have to be complex and time-consuming. Etlworks can help you solve your biggest data collection, extraction, transformation, and transportation challenges. Sign up for Etlwoks for free and get the most from your data pipeline, faster than ever before.
Change data capture (CDC) is a necessity. With CDC changes are synced instantly or near-instantly. In practice, CDC is often used to replicate data between databases in real-time. Essentially, CDC it is a perfect solution for non-intrusive, low-impact real-time data integration.
I will illustrate how the Etlworks change data capture solution works for eLearning company that needs to load data from 1600+ MySQL databases into Snowflake data warehouse. Thanks to Etlworks, company now can make everyday operations data available to non-technical business users. Data warehouse in Snowflake is updated every 5 minutes so users can gain quick, current business insights.
The company requirements
Setup data pipeline to load incremental updates from 1600+ MySQL databases into the Snowflake data warehouse.
The data pipeline must support INSERTs, UPDATEs, and DELETEs.
The data pipeline must be able to automatically pickup new databases and adjust the destination schema if new columns are added to the source tables.
The expected volume of data: hundreds of gigabytes, billions of records on initial load, tens of millions updates every day.
The expected number of tables across all MySQL databases: 55000.
The number of tables in Snowflake: 35
The data pipeline must be extremely resilient to the extract and load errors.
The data pipeline is expected to work in a fully automated mode.
Step 1: Setting up MySQL read replica instance in RDS
Etlworks team recommended to stream CDC events from the MySQL read replica. Our customer setup the MySQL instance in Amazon RDS and configured native MySQL replication from the production instance to the replica. Creating a read replica is an optional step, but it is significantly lessen a load of replication on your MySQL production instance.
The following permissions need to be configured for the MySQL user in the read replica MySQL instance.
It enables the connector to select rows from tables in databases.
It enables the connector the use of the FLUSH statement to clear or reload internal caches, flush tables, or acquire locks.
It enables the connector to see database names by issuing the SHOW DATABASE statement.
It enables the connector to connect to and read the MySQL server binlog.
It enables the connector the use of following statements: SHOW MASTER STATUS, SHOW SLAVE STATUS, and SHOW BINARY LOGS
Amazon RDS or Amazon Aurora that do not allow a global read lock, table-level locks are used to create a consistent snapshot
The binlog replication for MySQL read replica instance needs to be enabled by setting the following Amazon RDC parameters:
Step 2: Setting up flows to extract data from MySQL using CDC
In Etlworks there are two options for extracting data using CDC and loading into the Snowflake:
Create a separate flow to extract data from the database and create CSV files in the local storage and another flow to load CSV files into the Snowflake.
Create a separate flow to extract data from the database and ingest into the messaging queue, such as Kafka. Create another flow to load data from the queue into the Snowflake.
For this project, we decided to use option 1. Option 2 requires a complicated setup with a separate message queue such as Kafka or Azure Event Hubs.
Typical CDC flow can extract data from multiple tables in multiple databases but having a single flow pulling data from 55000+ tables would be a major bottleneck as it would be limited to a single blocking queue with a limited capacity. It would also create a single point of failure.
The better approach would be to create multiple parallel extract flows, each pulling data from all 35 Snowflake tables in a single database. However, considering that the flow must extract data from 1600 MySQL databases it would be impractical and very resource demanding – we would have to run 1600+ parallel extract flows.
We decided to create 35 extract flows where each flow is pulling data from a single table in 1600 databases
There are other possible topologies to consider as well, for example:
group tables and databases alphabetically;
create separate extract flows for the large and high-traffic tables;
for the rest of the tables split flows in chunks, each extracting data from a significant number of tables across multiple databases (for example 1000) ;
We recommend selecting the topology that works the best for you, keeping in mind the performance and the maintenance overhead.
Considering the design choice, we created a table in the Snowflake which has a list of databases to extract data from. This table is used to automatically populate the list of the included tables in a format: database1.table_abc,database2.table_abc,...database1500.table_abc and included databases in a format database1,database2,...database1500.
Step 3: Setting up flows to load data in Snowflake
Once again, there are multiple options to consider.
Option 1. Initially we setup a flow that loads the staged CSV files into Snowflake by using the CDC MERGE action which applies INSERTs/UPDATEs/DELETEs in order in which CDC events were originated in the source database.
For each CSV file the flow does the following:
Creates a temporary table in Snowflake.
Executes COPY INTO command to load the file “as is” into the temp table.
Uses Snowflake MERGE command to merge data in the temp table with the data in the actual table.
This approach guarantees the smallest latency but is more resource consuming and requires more Snowflake credits (can be more expensive from the Snowflake standpoint).
Option 2. Skip Delete events and only apply INSERTs and UPDATEs. It can be a bit faster compared to option 1.
Option 3. Always INSERT data into the staging tables in Snowflake, then periodically execute a SQL script to populate the actual tables by de-duplicating the staging tables and removing ‘d’ (delete) records.
This option is very popular when customers are OK with longer delays between data being updated in the source and finally available in the data warehouse for consumption by the BI tools.
After extensive testing, our customer decided to use option 3. The main reason was the fact that [almost] real-time approach (option 1) uses Snowflake cloud services and consumes extra Snowflake credits.
Step 4. Scheduling extract and load flows
The Extract Flows
In Etlworks, it is possible to schedule a flow to run continuously until there is nothing to do, then stop for a configurable number of seconds and restart. We recommended this schedule type for 35 extract flows. The extract flows are running until they are automatically stopped to let the system add new databases. The customer set the delay between restarts to 2 hours.
The load flow
The load flow is loading files into Snowflake in batches. We recommended running it every few minutes so it could clear the queue as often as possible. The customer set it to run every 5 minutes.
On average the pipelines load tens of millions of records into Snowflake daily, but there are days when the number of records jumps to hundreds of millions. As configured the pipelines can easily handle extracting and loading billions of records a day.
The Etlworks platform enables Snowflake users to leverage the low-impact, real-time change data capture from all major databases: MySQL, Postgres, SQL Server, Oracle, DB2, MongoDB.
Moving data continuously, as new database transactions occur makes it possible for Snowflake users to maintain the real-time data pipelines that feed Snowflake’s fast and flexible storage and analytics solutions.
If you would like a brief demo of CDC to Snowflake, please schedule a demo.
An API integration is the connection between two or more applications, via their APIs, that lets those systems exchange data. API integrations power processes throughout many high-performing businesses that keep data in sync, enhance productivity, and drive revenue.
An API is like a building block of programming that helps programmers avoid writing code from scratch each time it is needed. APIs are organized sets of commands, functions, and protocols that programmers use to develop software.
Consider building something with a box of Lego bricks. Instead of creating or carving a new block each time you need one, you simply choose from an assortment that are prepared and ready to be plugged into your project. Each block is designed to connect with other blocks in order to speed up the building process. In essence, that’s how an API works. APIs streamline and boost efficiency everywhere they are used.
The most commonly-used API for web services: REST
REST stands for Representational State Transfer. Where a regular API is a essentially a set of functions that allow communication between applications, a REST API is a architectural style for connected applications on the web.
In fact, today 70% of public APIs depend on REST services. REST APIs offer more flexibility, a gentler learning curve, and work directly from an HTTP URL rather than relying on XML.
How do REST APIs work?
At their simplest form, REST APIs for web services usually involve the following parties:
Your web-based, API-enabled application
Specific data request
While there are many different flavors of software and many different flavors of server, REST APIs act as a standardized wrapper to help your API-enabled applications successfully communicate with online servers to make information requests.
Now, how do API integrations work?
The term API integration can be defined as the process of creating a means for two or more APIs to share data and communicate with each other without human interruption. It involves the use of APIs to enable communication between two web tools or applications. It allows organizations to automate their systems, enhance the seamless sharing of data, and integrate current applications.
API integration has become pivotal in the modern world due to the explosion of cloud-based products and apps. As such, organizations need to create a connected system where data will be relayed flawlessly between various software tools without the need to do it manually. API integration has proved to be the much-needed solution as it allows the sharing of process and enterprise data among applications in a given ecosystem. It improves the flexibility of information and service delivery, as well as makes the embedding of content from different sites and apps easy. An API acts as the interface that permits the integration of two applications.
Building API integration into your application or workflow allows you to make the most of your data by employing it where it will be most useful. Whether this means using it for analytics or as content within your software or applications, integration keeps you connected to your business information.
Getting started with APIs
Data integration using API technology doesn’t have to be daunting. Etlworks Integrator uses API-led connectivity to create an application network of apps, data, and devices, both on-premises and in the cloud. It provides developers with a complete toolkit to enhance scalability, simplify mapping and security, and for managing, preparing, and profiling all of your company’s data. Whether you want to build event-driven data integration flows or incorporate Integrator into your own automation workflow, you can use the developer features below to achieve your goals.