Cloud Data Integration


In this blog post, I will discuss the definition of cloud data integration and what makes it truly useful.  

Before we start, let’s get on the same page and define what cloud data integration is.

According to Wikipedia, cloud data integration software must have the following features:

  • Deployed on a multi-tenant, elastic cloud infrastructure.
  • Subscription model pricing: operating expense, not capital expenditure.
  • No software development: required connectors should already be available.
  • Users do not perform deployment or manage the platform itself.
  • Presence of integration management & monitoring features.

While I agree with the definition, there’s something is missing:

where is the data we are suppose to be integrating?

If you are ahead of the curve, all your data is already stored in the cloud. While I think we all will be here eventually, as of today, a typical enterprise – from two guys working out of a garage to multinational corporations – owns and operates multiple data silos. I would add diverse and isolated data silos:

  • Cloud databases.
  • On-premise databases, available from the Internet.
  • On-premise databases, not available from the Internet.
  • Public APIs.
  • Private APIs, not available from the Internet.
  • Cloud-based third-party applications.
  • Locally hosted third-party applications.
  • Legacy applications.
  • Files stored locally.
  • Files stored in cloud data storage.
  • Office 365 and Google Docs documents.

Can your favorite data integration platform handle the vast array of data sources? If the answer is “Yes it can! We just need to deploy it in our corporate network and it will be able to connect to all our data,” then it is not a cloud data integration anymore. Don’t get me wrong, there is nothing wrong with the ETL tool deployed locally. It gets the job done, but you are not getting the benefits of the true cloud-based platform, specifically this one:

users do not perform deployment or manage the platform itself.

If this is not a showstopper, my advice is to find and stick to the tool which has all the required connectors and is easy to program and operate. Sure, you will need a competent DevOps group on payroll (in addition to the ETL developers), who will be managing and monitoring the tool, installing upgrades, performing maintenance, etc., but hey…it works.

Keep reading if you want to focus on breaking the data silos in your organization instead of managing the data integration platform. The solution to the problem at hand is so-called hybrid data integration.

Hybrid data integration is a solution when some of the connectors can run on-premise, behind the corporate firewall, while others, and the platform itself runs on the cloud.

We, at Etlworks, believe that no data silo should be left behind, so in addition to our best in class cloud data integration service we offer fully autonomous, zero-maintenance data integration agents which can be installed on-premise, behind the corporate firewall.  Data integration agents are essentially connectors installed locally and seamlessly integrated with a cloud-based Etlworks service.

Let’s consider these typical data integration scenarios:

Source and destination are in the cloud

Example: the source is an SQL Server database in Amazon RDS and the destination is a Snowflake data warehouse.

In this case, no extra software is required. Etlworks can connect to the majority of the cloud-based databases and APIs directly. Check out available connectors.

The source is on-premise, behind the firewall and the destination is in the cloud

Example: the source is locally hosted PostgreSQL database, not available from the Internet, and the destination is Amazon Redshift.

In this scenario, you will need to install a data integration agent as a Windows or Linux service in any available computer in your corporate network (you can install multiple agents in multiple networks if needed). The agent includes a built-in scheduler so it can run periodical extracts from your on-premise database and push changes either directly to the cloud data warehouse or to the cloud-based data storage ( Amazon S3, for example).  You can then configure a flow in Etlworks, which will be pulling data from the cloud data storage and loading into the cloud-based data warehouse.  The flow can use the extremely fast direct data upload into the Redshift available as a task in Etlworks.

The source is in the cloud and the destination is on-premise, behind the firewall

Example: the source is a proprietary API, available from the Internet and the destination is a database in the Azure cloud.

Data Integration Agent can work both ways: extracting data from the sources behind the firewall and loading data into the local databases. In this scenario, the flow in Etlworks will be extracting data from the cloud-based API, then transforming and storing it in the cloud-based storage, available to the agent. The data integration agent will be loading data from the cloud-based storage directly into the local database.

The source is in the cloud and a destination is a third-party software hosted locally

Example: the source is a proprietary API, available from the Internet and the destination is locally hosted data analytics software.

If the third-party software can load data from services such as Google Sheets, you can configure a flow in Etlworks, which will be extracting and transforming data from the API and loading into the Google Sheets. The third-party software will then be loading data directly from the specific worksheet. You can always find a format which is understood by Etlworks and a third-party software.

Source and destination are on-premise, behind the firewall

Example: the source is a proprietary API, not available from the Internet and the destination is another proprietary API, again not available from the Internet.

In this case, you probably don’t need cloud-based software at all. Look at Etlworks on-premise subscription options as well as an option to buy a perpetual license.