ETL/ELT all your data into Amazon Redshift DW

amazon_integration

Amazon Redshift is fast, scalable, and easy-to-use, making it a popular data warehouse solution. Redshift is straightforward to query with SQL, efficient for analytical queries and can be a simple add-on for any organization operating its tech stack on AWS.

Amazon Web Services have many benefits. Whether you choose it for the pay as you go pricing, high performance, and speed or its versatile and flexible services provided, we are here to present you the best data loading approaches that work for us.

Etlworks allows users to load your data from cloud storages and APIs, SQL and NoSQL databases, web services to Amazon Redshift data warehouse in a few simple steps. You can configure and schedule the flow using intuitive drag and drop interface and let Etlworks do the rest.

Etlworks supports not just one-time data loading operation. It can help you to integrate your data sources with Amazon Redshift and automate updating your Amazon Redshift with fresh data with no additional effort or involvement!

Today we are going to examine how to load data into Amazon Redshift.

A typical Redshift flow performs the following operations:

  • Extract data from the source.
  • Create CSV files.
  • Compress files using the gzip algorithm.
  • Copy files into Amazon S3 bucket.
  • Check to see if the destination Amazon Redshift table exists, and if it does not – creates the table using metadata from the source.
  • Execute the Amazon Redshift COPY command.
  • Clean up the remaining files.

There are some prerequisites have to be met, before you can design a flow that loads data into Amazon Redshift:

Now, you are ready to create a Redshift flow. Start by opening the Flows window, clicking the + button, and typing redshift into the search field:

redshift-flows

Continue by selecting the flow type, adding source-to-destination transformations and entering the transformation parameters:

redshift-transformation

You can select one of the following sources (FROM) for the Redshift flow:

  • API – use any appropriate string as the source (FROM) name
  • Web Service – use any appropriate string as the source (FROM) name
  • File – use the source file name or a wildcard filename as the source (FROM) name
  • Database – use the table name as the source (FROM) name
  • CDC – use the fully qualified table name as the source (FROM) name
  • Queue – use the queue topic name as the source (FROM) name

For most of the Redshift flows, the destination (TO) is going to be Amazon S3 connection. To configure the final destination, click the Connections tab and select the available Amazon Redshift connection.

redshift-connection

Amazon Redshift can load data from CSVJSON, and Avro formats but Etlwoks supports loading only from CSV so you will need to create a new CSV format and set it as a destination format. If you are loading large datasets into Amazon Redshift, consider configuring a format to split the document into smaller files. Amazon Redshift can load files in parallel, also transferring smaller files over the network can be faster.

If necessary, you can create a mapping  between the source and destination (Redshift) fields.

Mapping is not required, but please remember that if a source field name is not supported by Redshift, it will return an error and the data will not be loaded into the database. For example, if you are loading data from Google Analytics, the output (source) is going to include fields with the prefix ga: ( ga:user, ga:browser, etc. ). Unfortunately, Amazon Redshift does not support fields with a : , so the data will be rejected. If that happens, you can use mapping to rename the destination fields.

ELT for Amazon Redshift

Amazon Redshift provides affordable and nearly unlimited computing power which allows loading data to Amazon Redshift as-is, without pre-aggregation, and processing and transforming all the data quickly when executing analytics queries. Thus, the ETL (Extract-Transform-Load) approach transforms to ELT (Extract-Load-Transform). This may simplify data loading to Amazon Redshift greatly, as you don’t need to think about the necessary transformations.

Etlworks supports executing complex ELT scripts directly into Amazon Redshift which greatly improves performance and reliability of the data injection.

I hope this has been helpful. Go forth and load large amounts of data.

Author: Olena S.

Data Architect, University of Pittsburgh

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s