What is a Data Pipeline?

Etlworks is the leading provider of cloud-based managed data pipelines.

To gain business insights for competitive advantage every business these days is seeking ways to integrate data from multiple sources. Data and data analytics are critical to business operations, so it’s important to engineer and deploy strong and maintainable data pipelines.

data pipeline is a set of actions that ingest raw data from disparate sources and move the data to a destination for storage and analysis. Most of the time, though, a data pipeline is also to perform some sort of processing or transformation on the data to enhance it.

Data pipelines often deliver mission-critical data and for important business decisions, so ensuring their accuracy and performance is required whether you implement them through data integration and ETL platforms, data-prep technologies, or real time data streaming architectures.

How is a data pipeline different from ETL?

You may commonly hear the terms ETL and data pipeline used interchangeably. ETL stands for Extract, Transform, and Load. The major dissimilarity of ETL is that it focuses entirely on one system to extract, transform, and load data to a particular data warehouse. Alternatively, ETL is just one of the components that fall under the data pipeline.

ETL pipelines move the data in batches to a specified system with regulated intervals. Comparatively, data pipelines have broader applicability to transform and process data through streaming or real-time.

Data pipelines do not necessarily have to load data to a database or data warehouse. It might be loaded to any number of targets, such as an AWS bucket or a data lake, or it might even trigger a webhook on another system to kick off a specific business process.

Data pipeline solutions

The nature and functional response of data pipeline would be different from cloud tools to migrate data to outright use it for a real-time solution. The following list shows the most popular types of pipelines available. Note that these systems are not mutually exclusive. For example, you might have a data pipeline that is optimized for both cloud and real-time.

Cloud-Based

The cost-benefit ratio of using cloud-based tools to integrate data is quite high. These tools are hosted in the cloud, allowing you to save money on infrastructure and expert resources because you can rely on the infrastructure and expertise of the vendor hosting your pipeline.

Batch

Batch processing allows you to easily transport a large amount of data at interval without having to necessitate real-time visibility. The process makes it easier for analysts who combine a multitude of marketing data to form a decisive result or pattern.

Real-Time or Streaming

Real-time or streaming processing is useful when an organization processes data from a streaming source, such as the data from financial markets or internet of things (IoT) devices and sensors. Real-time processing captures data as it comes off the source systems in real time, performs rudimentary data transformations (filters, samples, aggregates, calculates averages, determines min/max values) before firing off data to the downstream process.

Open Source

Open Source tools are ideal for small business owners who want lower cost and over-reliance on commercial vendors. However, the usefulness of such tools requires expertise to use the functionality because the underlying technology is publicly available and meant to be modified or extended by users.

Data pipeline use cases

Data Migration

Data pipelines are used to perform data migration tasks. These might involve moving data from databases, e.g. MongoDB, Oracle, Microsoft SQL Server, PostgreSQL, and MySQL into the cloud. Cloud databases are scalable and flexible and enable for easier creation of other data pipelines that use real-time streaming.

Data Warehousing and Analysis

Probably the most common destination for a data pipeline is a dashboard or suite of analytical tools. Raw data that is structured via ETL can be loaded into databases for analysis and visualization. Data scientists can then create graphs, tables and other visualizations from the data. This data can then be used to inform strategies and guide the purpose of future data projects.

AI and Machine Learning Algorithms

ETL and ELT pipelines can move data into machine learning and AI models. Machine learning algorithms can learn from the data, performing advanced parsing and wrangling techniques. These ML models can then be deployed into various software. Machine learning algorithms fed by data pipelines can be used in marketing, finance, science, telecoms, etc.

IoT Integration

Data pipelines are frequently used in IoT systems that use networks of sensors for data collection. Data inducted from various sources across a network can be transformed into data available for ready analysis. For example, an ETL pipeline may perform numerous calculations on huge quantities of delivery tracking information, vehicle locations, delay expectations, etc, to form a rough ETA estimate.

Getting started with a data pipeline

Setting up a reliable data pipeline doesn’t have to be complex and time-consuming. Etlworks can help you solve your biggest data collection, extraction, transformation, and transportation challenges. Sign up for Etlwoks for free and get the most from your data pipeline, faster than ever before.

Real-World Change Data Capture Example

Change data capture (CDC) is a necessity. With CDC changes are synced instantly or near-instantly. In practice, CDC is often used to replicate data between databases in real-time. Essentially, CDC it is a perfect solution for non-intrusive, low-impact real-time data integration.

I will illustrate how the Etlworks change data capture solution works for eLearning company that needs to load data from 1600+ MySQL databases into Snowflake data warehouse. Thanks to Etlworks, company now can make everyday operations data available to non-technical business users. Data warehouse in Snowflake is updated every 5 minutes so users can gain quick, current business insights.

The company requirements

  • Setup data pipeline to load incremental updates from 1600+ MySQL databases into the Snowflake data warehouse.
  • The data pipeline must support INSERTs, UPDATEs, and DELETEs.
  • The data pipeline must be able to automatically pickup new databases and adjust the destination schema if new columns are added to the source tables.
  • The expected volume of data: hundreds of gigabytes, billions of records on initial load, tens of millions updates every day. 
  • The expected number of tables across all MySQL databases: 55000.
  • The number of tables in Snowflake: 35
  • The data pipeline must be extremely resilient to the extract and load errors.
  • The data pipeline is expected to work in a fully automated mode.

Etlworks Solution

Step 1: Setting up MySQL read replica instance in RDS

Etlworks team recommended to stream CDC events from the MySQL read replica.  Our customer setup the MySQL instance in Amazon RDS and configured native MySQL replication from the production instance to the replica. Creating a read replica is an optional step, but it is significantly lessen a load of replication on your MySQL production instance.

The following permissions need to be configured for the MySQL user in the read replica MySQL instance.

PermissionDescription
SELECTIt enables the connector to select rows from tables in databases.
RELOADIt enables the connector the use of the FLUSH statement to clear or reload internal caches, flush tables, or acquire locks.
SHOW DATABASESIt enables the connector to see database names by issuing the SHOW DATABASE statement.
REPLICATION SLAVEIt enables the connector to connect to and read the MySQL server binlog.
REPLICATION CLIENTIt enables the connector the use of following statements: SHOW MASTER STATUS, SHOW SLAVE STATUS, and SHOW BINARY LOGS
LOCK_TABLES Amazon RDS or Amazon Aurora that do not allow a global read lock, table-level locks are used to create a consistent snapshot

The binlog replication for MySQL read replica instance needs to be enabled by setting the following Amazon RDC parameters:

binlog_format: ROW
log_bin_use_v1_row_events: 1
net_read_timeout: 3600
net_write_timeout: 3600
wait_timeout: 86400
Step 2: Setting up flows to extract data from MySQL using CDC

In Etlworks there are two options for extracting data using CDC and loading into the Snowflake:

  1. Create a separate flow to extract data from the database and create CSV files in the local storage and another flow to load CSV files into the Snowflake.
  2. Create a separate flow to extract data from the database and ingest into the messaging queue, such as Kafka.  Create another flow to load data from the queue into the Snowflake.

For this project, we decided to use option 1. Option 2 requires a complicated setup with a separate message queue such as Kafka or Azure Event Hubs.

Typical CDC flow can extract data from multiple tables in multiple databases but having a single flow pulling data from 55000+ tables would be a major bottleneck as it would be limited to a single blocking queue with a limited capacity. It would also create a single point of failure.

The better approach would be to create multiple parallel extract flows, each pulling data from all 35 Snowflake tables in a single database. However, considering that the flow must extract data from 1600 MySQL databases it would be impractical and very resource demanding – we would have to run 1600+ parallel extract flows. 

We decided to create 35 extract flows where each flow is pulling data from a single table in 1600 databases

There are other possible topologies to consider as well, for example:

  • group tables and databases alphabetically;
  • create separate extract flows for the large and high-traffic tables;
  • for the rest of the tables split flows in chunks, each extracting data from a significant number of tables across multiple databases (for example 1000) ;

We recommend selecting the topology that works the best for you, keeping in mind the performance and the maintenance overhead. 

Considering the design choice, we created a table in the Snowflake which has a list of databases to extract data from. This table is used to automatically populate the list of the included tables in a format: database1.table_abc,database2.table_abc,...database1500.table_abc and included databases in a format database1,database2,...database1500.

The basic idea is that we can use a single MySQL CDC connection where the Included Databases and Included Tables are set as {tokens}, populated at run-time by JavaScript.

Step 3: Setting up flows to load data in Snowflake

Once again, there are multiple options to consider.

Option 1. Initially we setup a flow that loads the staged CSV files into Snowflake by using the CDC MERGE action which applies INSERTs/UPDATEs/DELETEs in order in which CDC events were originated in the source database.

For each CSV file the flow does the following:

  1. Creates a temporary table in Snowflake.
  2. Executes COPY INTO command to load the file “as is” into the temp table.
  3. Uses Snowflake MERGE command to merge data in the temp table with the data in the actual table.

This approach guarantees the smallest latency but is more resource consuming and requires more Snowflake credits (can be more expensive from the Snowflake standpoint). 

Option 2. Skip Delete events and only apply INSERTs and UPDATEs. It can be a bit faster compared to option 1. 

Option 3. Always INSERT data into the staging tables in Snowflake, then periodically execute a SQL script to populate the actual tables by de-duplicating the staging tables and removing ‘d’ (delete) records. 

This option is very popular when customers are OK with longer delays between data being updated in the source and finally available in the data warehouse for consumption by the BI tools. 

After extensive testing, our customer decided to use option 3. The main reason was the fact that [almost] real-time approach (option 1) uses Snowflake cloud services and consumes extra Snowflake credits.

Step 4. Scheduling extract and load flows

The Extract Flows

In Etlworks, it is possible to schedule a flow to run continuously until there is nothing to do, then stop for a configurable number of seconds and restart. We recommended this schedule type for 35 extract flows.  The extract flows are running until they are automatically stopped to let the system add new databases. The customer set the delay between restarts to 2 hours. 

The load flow

The load flow is loading files into Snowflake in batches. We recommended running it every few minutes so it could clear the queue as often as possible. The customer set it to run every 5 minutes. 

On average the pipelines load tens of millions of records into Snowflake daily, but there are days when the number of records jumps to hundreds of millions. As configured the pipelines can easily handle extracting and loading billions of records a day.

Conclusion

The Etlworks platform enables Snowflake users to leverage the low-impact, real-time change data capture from all major databases: MySQL, Postgres, SQL Server, Oracle, DB2, MongoDB.

Moving data continuously, as new database transactions occur makes it possible for Snowflake users to maintain the real-time data pipelines that feed Snowflake’s fast and flexible storage and analytics solutions.

If you would like a brief demo of CDC to Snowflake, please schedule a demo.

What is an API integration?

An API integration is the connection between two or more applications, via their APIs, that lets those systems exchange data. API integrations power processes throughout many high-performing businesses that keep data in sync, enhance productivity, and drive revenue.

What is an API?

An API is a set of definitions and protocols for building and integration application software. API stands for application programming interface.

An API is like a building block of programming that helps programmers avoid writing code from scratch each time it is needed. APIs are organized sets of commands, functions, and protocols that programmers use to develop software.

Consider building something with a box of Lego bricks. Instead of creating or carving a new block each time you need one, you simply choose from an assortment that are prepared and ready to be plugged into your project. Each block is designed to connect with other blocks in order to speed up the building process. In essence, that’s how an API works. APIs streamline and boost efficiency everywhere they are used.

The most commonly-used API for web services: REST

REST stands for Representational State Transfer. Where a regular API is a essentially a set of functions that allow communication between applications, a REST API is a architectural style for connected applications on the web.

In fact, today 70% of public APIs depend on REST services. REST APIs offer more flexibility, a gentler learning curve, and work directly from an HTTP URL rather than relying on XML.

How do REST APIs work?

At their simplest form, REST APIs for web services usually involve the following parties:

  • Your web-based, API-enabled application
  • Remote server
  • Specific data request
  • Returned data/function

While there are many different flavors of software and many different flavors of server, REST APIs act as a standardized wrapper to help your API-enabled applications successfully communicate with online servers to make information requests.

Now, how do API integrations work?

The term API integration can be defined as the process of creating a means for two or more APIs to share data and communicate with each other without human interruption. It involves the use of APIs to enable communication between two web tools or applications. It allows organizations to automate their systems, enhance the seamless sharing of data, and integrate current applications.

API integration has become pivotal in the modern world due to the explosion of cloud-based products and apps. As such, organizations need to create a connected system where data will be relayed flawlessly between various software tools without the need to do it manually. API integration has proved to be the much-needed solution as it allows the sharing of process and enterprise data among applications in a given ecosystem. It improves the flexibility of information and service delivery, as well as makes the embedding of content from different sites and apps easy. An API acts as the interface that permits the integration of two applications.

Building API integration into your application or workflow allows you to make the most of your data by employing it where it will be most useful. Whether this means using it for analytics or as content within your software or applications, integration keeps you connected to your business information.

Getting started with APIs

Data integration using API technology doesn’t have to be daunting. Etlworks Integrator uses API-led connectivity to create an application network of apps, data, and devices, both on-premises and in the cloud. It provides developers with a complete toolkit to enhance scalability, simplify mapping and security, and for managing, preparing, and profiling all of your company’s data. Whether you want to build event-driven data integration flows or incorporate Integrator into your own automation workflow, you can use the developer features below to achieve your goals.

See how Etlworks can help you make the most of API with a free demo of Data Integration Platform.

What is Data Integration?

In 2020, it is definitive to argue that one of a company’s most significant assets is its Data.

Data integration involves collecting data from different sources, combining this data into a unified set that can easily be used for a predetermined business function, and enabling you to query and manipulate all of your data from a single interface and derive analytics and statistics. While the sources and types of data continue to grow, it becomes increasingly important to be able to perform quality analysis on that data.

Benefits of Data integration

Data integration is becoming more and more common, as numerous apps and companies race to meet consumer demand to have all of their data collected in one place and in a useful format.

From small businesses that want to spend less employee time creating reports, to large corporations taking a deep dive into an array of user statistics, data integration is indispensable for businesses that want to compete in today’s economy.

Here are the primary benefits of data integration offered to businesses.

Integrate data from multiple sources

Businesses are using tons of applications, systems, and data warehouses within your organization. Nevertheless, these data sources are disparate and siloed. A data silo, is a repository of data that is isolated. Generally, in businesses, this means that the information is under the control of a business unit or department and is not available across the organization. When systems containing valuable data are integrated across an organization, then the data gathered by one app or department can benefit the company as a whole, not just a team or individual.

Inter-system cooperation ensures that your company is free from information silos that benefit the few, and allow data gathered across the organization to be viewed and used by anyone who needs access to it.

For example, integrating data from multiple online stores can give you a more complete understanding of customer behavior and payment processing preferences.

Data integrity and data quality

Data integrity is an important element of data integrations. Data integrity is the assurance of the consistency and quality of the data through its entire lifecycle.
Nevertheless, most of the data is still of poor quality. Eliminating errors in data sets used for business intelligence and decision making is one of the primary advantages of data integration.

When data is gathered manually, then every tool, database and user account must be properly accounted for and set up prior to collecting this data, meaning that any data source overlooked or added last minute results in an incomplete data set as the end result. This also means that reporting protocols must be re-worked whenever a new data source is added.

When data systems are integrated properly, however, errors from overlooked sources do not occur and reports can be run and accessed in real time.

Save time and boost efficiency

When a company takes measures to integrate its data properly, it cuts down significantly on the time it takes to prepare and analyze that data. The automation of unified views cuts out the need for manually gathering data, and employees no longer need to build connections from scratch whenever they need to run a report or build an application.

Additionally, using the right tools, rather than hand-coding the integration, returns even more time (and resources overall) to the dev team.
When systems are properly integrated, collecting data and converting it into its final, usable format becomes a quick, easy task instead of a treasure hunt across your company’s data assets.

All the time saved on these tasks can be put to other, better uses, with more hours earmarked for analysis and execution to make an organization more productive and competitive.

Increase competitiveness

This is the most straightforward from all the other benefits of data integration.

Having a data integration strategy in place can help you to plan what actions you need to take to improve the accessibility of data both externally and internally, you will be able to influence many vital parts of your business. Yes, the overall goal is to generate more profit. An essential part of that is delighting your customers, so offering better services than your competition cannot be overlooked.

Tools for Data Integration

There are numerous tools available in the market that would help us query out the Data effectively since our Data is not going to integrate itself. To name a few, we have some Open Source Data Integration Tools, Cloud-based Data Integration Tools and the On-premises data integration tools.

Again, the question is that how to and which one to choose among those various tools available in the market.  The features you should look for in a data integration tool are:

A lot of connectors. There are many systems and applications in the world; the more pre-built connectors your Data Integration tool has, the more time your team will save.

Portability. It is important, as companies increasingly move to hybrid cloud models, to be able to build your data integrations once and run them anywhere

Ease of use. Data integration tools should be easy to learn and easy to use with a GUI interface to make visualizing your data pipelines simpler.

A transparent price model. Your data integration tool provider should not ding you for increasing the number of connectors or data volumes.

Cloud compatibility. Your data integration tool should work natively in a single cloud, multi-cloud, or hybrid cloud environment.

Business intelligence, analytics, and competitive edges are all at stake when it comes to data integration. That is why it is critical for your company to have full access to every data set from every source.

Simple data integration with Etlworks

Etlworks is an all-in-one, any-to-any data integration platform for all your ETL projects, regardless of the complexity, data location, format and volume. There is no code to write — with just a few clicks, Etlworks will extract your data from wherever it lives and get it ready to be analyzed, understood, and acted upon.

Etlworks offers a free 14-day trial. Give it a try today!

Version Control and Why You Need It

data-integration-essential-questions

Data integration is becoming a hot topic.

Business intelligence applications can make use of a comprehensive set of information provided through data integration to derive important business insights from a company’s historic and current data. By providing executives and managers with an in-depth understanding of the company’s current operations, as well as the opportunities and risks it faces in the marketplace, data integration can have a direct bottom-line impact.

In the past, the job of integrating systems was a complex, often costly, frequently cumbersome process, involving on-premise integration tools. These days a new breed of applications, SaaS (Software-as-a-service), is taking over as the integration platform of choice for new integration projects. There is a lot to consider when the developers and IT management approach a data integration project.

The features like ease of use, scalability and flexibility, real-time integration, security, and compliance are essential. This blog will talk about the importance of one feature that often gets overlooked – version control.

Why the version control is important?

Because the data integration tool is a software that is used in performing a data integration process on the data source moving the data to the destination. And, as you are planning to work on a big software project that consists of technical concepts, requires a collaboration of different team members and needs frequent changes, you need to use a version control system (VCS).

Version control is essential to track, organize and control changes over source code and avoid confusion, especially for large, fast-changing projects, like data integration.

Etlworks is a modern, cloud-first, any-to-any data integration platform that supports version control for connections, formats, listeners, flows & macros.

Etlworks Integrator automatically tracks changes for all artifacts: connections, formats, listeners, flows, macros, and schedules. There is nothing to configure except the retention policy.

Version control is supported for the following artifacts: flows, connections, formats, listeners, schedules, macros.

Using Etlwork’s built-in version control you can:

  • View the history of changes – who-changed-what-and-when.
  • Compare any two versions.
  • Revert to any previous version.
  • Add comments to the commit when saving the artifacts.

Etlworks version control 

To access the version control UI click the [@] button at the bottom left corner of each screen.

mceclip0

Then, click the [@] button at the bottom left corner of each screen to view the history of changes.

mceclip4

You can compare any two versions by selecting versions and click [View Diff between Selected]. mceclip5

The changes are displayed in a two-panel or unified view. From this view, you can navigate back and forward in the history of changes and also revert to any version.mceclip12

To view the specific version and compare it with a previous one click the [eye] button.mceclip6

To revert to any previous version:

Step 1. Click the [Revert] button.mceclip7

Step 2. Confirm that you want to revert to the previous version by clicking the Revert button.

Step 3. Click [Save] or [Save with message] to complete the revert.mceclip9

When saving any change, to any artifact you have an option to add a message (comments) to the commit. To add a message to the commit click the [Save with message] button.mceclip10

The commit message is displayed in the Change History popup.mceclip11

In Conclusion

After reading this blog, hopefully, you have a better understanding of version control. Version control is very helpful for organizing and backing up artifacts you are working with. It is also helpful for multiple people working on one artifact. Version Control is a great way to keep your artifacts organized and backed up in case of the worse.

Begin your data journey, get 14-day free trial!