copy into – Etlworks Blog

etlworks-snowflake

In this blog post, I will be talking about building a reliable data injection pipeline for Snowflake.

Snowflake is a data warehouse built for the cloud. It works across multiple clouds and combines the power of data warehousing, the flexibility of big data platforms, and the elasticity of the cloud.

Based on the Snowflake documentation, loading data is a two-step process:

Upload (i.e. stage) one or more data files into either an internal stage (i.e. within Snowflake) or an external location.
Use the COPY INTO command to load the contents of the staged file(s) into a Snowflake database table.

It is obvious that one step is missing: preparing data files to be loaded in Snowflake.

If steps 1-3 do not look complicated to you, let’s add more details.

Typically, developers that are tasked with loading data into any data warehouse dealing with the following issues:

How to build a reliable injection pipeline, which loads hundreds of millions of records every day.
How to load only recent changes (incremental replication).
How to transform data before loading into the data warehouse.
How to transform data after loading into the data warehouse.
How to deal with changed metadata (table structure) in both the source and in the destination.
How to load data from nested datasets, typically returned by the web services (in addition to loading data from the relational databases).

This is just a short list of hand-picked problems. The good news is that Snowflake is built from the ground up to help with bulk-loading data, thanks to the very robust COPY INTO command, and continues-loading using Snowpipe.

Any Snowflake injection pipeline should at least be utilizing the COPY INTO command and, possibly Snowpipe.

The simplest ETL process that loads data into the Snowflake will look like this:

Extract data from the source and create CSV (also JSON, XML, and other formats) data files.
Archive files using gz compression algorithm.
Copy data files into the Snowflake stage in Amazon S3 bucket (also Azure blob and local file system).
Execute COPY INTO command using a wildcard file mask to load data into the Snowflake table.
Repeat 1-4 for multiple data sources. Injection ends here.
If needed, execute SQL statements in Snowflake database to transform data. For example, populate dimensions from the staging tables.

The part where you need to build a “reliable data injection pipeline” typically includes:

Performance considerations and data streaming.
Error-handling and retries.
Notifications on success and failure.
Reliability when moving files to the staging area in S3 or Azure.

COPY INTO command can load data from the files archived using gz compression algorithm. So, it would make sense to archive all the data files before copying or moving them to the staging area.

Cleaning up: what to do with all these data files after they have been loaded (or not loaded) into the Snowflake.
Dealing with changing table structure in the source and in the destination.

Snowflake supports transforming data while loading it into a table using the COPY INTO <table> command but it will not allow you to load data with inconsistent structure.

Add the need to handle incremental updates in the source (change replication) and you got yourself a [relatively] complicated project at hands.

As always, there are two options:

Develop home-grown ETL using a combination of scripts and in-house tools.
Develop solution using third-party ETL tool or service.

Assuming that you are ready to choose option 2 (if not, go to paragraph one), let’s discuss

The requirements for the right ETL tool for the job

When selecting the ETL tool or service the questions you should be asking yourself are:

How much are you willing to invest in learning?
Do you prefer the code-first or the drag&drop approach?
Do you need to extract data from the semi-structured and unstructured data sources (typically web services) or all your data is in the relational database?
Are you looking for point-to-point integration between well-known data sources (for example, Salesforce->Snowflake ) with the minimum customization, or you need to build a custom integration?
Do you need your tool to support change replication?
How about real-time or almost real-time ETL?
Are you looking for a hosted and managed service, running in the cloud or on-premise solution?

Why Etlworks is the best tool for loading data in Snowflake?

First, just like Snowflake, Etlworks is a cloud-first data integration service. It works perfectly well when installed on-premise, but it really shines in the cloud. When subscribing to the service, you can choose the region that is closest to your Snowflake instance which will make all the difference as far as the fast data load is concerned. Also, you won’t have to worry about managing the service.

Second, in Etlwoks you can build even the most complicated data integration flows and transformations using simple drag&drop interface. No need to learn a new language and no complicated build-test-deploy process.

Third, if you are dealing with heterogeneous data sources, web services, semi-structured or unstructured data, or transformations which go beyond the simple point-to-point, pre-baked integrations – you are probably limited to just a few tools. Etlworks is one of them.

Last but not least, if you need your tool to support a native change (incremental) replication from relational databases or web services, Etlworks can handle this as well. No programming required. And it is fast.

How it works

In Etlworks, you can choose from several highly configurable data integration flows, optimized for Snowflake:

Extract data from databases and load in Snowflake.
Extract data from data objects (including web services) and load in Snowflake.
Extract data from well-known APIs (such as Google Analytics) and load in Snowflake.
Load existing files in Snowflake.
Execute any SQL statement or multiple SQL statements.

Behind the scene, the flows perform complicated transformations and create data files for Snowflake, archive files using gz algorithm before copying to the Snowflake stage in the cloud or in the server storage, automatically create and execute COPY INTO <table> command, and much more. For example, the flow can automatically create a table in Snowflake if it does not exist, or it can purge the data files in case of error (Snowflake can automatically purge the file in case of success).

You can find the actual, step-by-step instructions on how to build Snowflake data integration flows in Etlworks in our documentation.

The extra bonus is that in Etlworks you can connect to the Snowflake database, discover the schemas, tables, and columns, run SQL queries, and share queries with the team. All without ever using Snowflake SQL workbench. Even better – you can connect to all your data sources and destinations, regardless of the format and location to discover the data and the metadata. Learn more about Etlworks Explorer.