Etlworks CEO: Efficient data structures and real-time data streaming

Data-Mgmt

Search Business Analytics recently interviewed Etlworks CEO, Maksym Sherbinin about the importance of creating efficient data structures and real-time data streaming.

Here are his insights:

Q: To what extent does defaulting to JSON as a data storage mechanism contribute to decreased analytics performance and larger file sizes?

A: There are two factors that contribute to JSON (as well as CSV and XML) not being an optimal format for storing and processing large datasets:

  • The lack of schema enforcement, which results in the same data being injected and stored multiple times in the data warehouse, definitely has a direct impact on performance, quality of service, and cost.
  • JSON-encoded datasets cannot be stored in a parallel manner, decreasing its usefulness for systems built on top of the Hadoop and other frameworks used for distributed processing.

Q: What are the existing alternatives for developers to create more efficient data structures for different kinds of analytics workloads in terms of data/file formats?

A: There are two very distinct alternatives:

  • Column-based databases optimized for data warehousing and massively parallel processing, specifically Snowflake and Amazon Redshift.
  • Formats, optimized for use in Hadoop clusters and data streaming: Optimized Row Columnar (ORC), Avro, and Parquet.

Q: What is the current state of tools that can automate some of this process (i.e. are there tools that can look at how an app works with data and recommend a better file format or data structure, and perhaps compile or transform it into these more efficient formats like traditional programming language compilers)?

A: When it comes to encoding data in any of the Hadoop-optimized formats, such as Avro and Parquet, there are great open-source libraries available for practically any programming language known to humankind. Converting data stored somewhere else (for example in the database) to the Avro or Parquet datasets and storing them in the distributed file system can be performed using home-grown tools or (preferred) a good ETL tool, which can connect to any source, map, transform, and store data in the Hadoop cluster. Below are some of the any-to-any ETL tools which support various sources, Avro and Parquet as a destination format, and Hadoop Distributed File System (HDFS) as a destination:

  • Etlworks
  • Talend
  • Informatica

The other trend is real-time data streaming that uses distributed platforms such as Apache Kafka. The Alooma enterprise pipeline is a good example of the service, which can be used to stream large datasets to analytical platforms using datasets encoded as Avro or Parquet.

Additionally, there is a growing list of tools which can be used to inject data into the column-based data warehouses, built for the cloud, such as Snowflake and Amazon Redshift. Depending on your requirements you might consider the following tools:

  • Matillion ETL
  • Etlworks Integrator
  • AWS Glue

Finally, selecting the best format for the job is not something any tool should be automatically doing for you, but having multiple options is a must. The ability to experiment and prototype with different formats and storage options, as well as injection techniques (batch load, streaming, SQL, etc.) is why a good ETL tool goes a long way. It saves you time, money and provides a peace of mind.


Q: What are your thoughts on the best practices for finding the balance between identifying the most efficient data structures for a given analytics use case, and quickly experimenting with different analytics use cases?

A: This article describes how Uber went from the general use database, to JSON-encoded datasets, to the Hadoop-based big data platform. Granted, not all business require something as extreme, so common sense is your best friend. A typical relational database such as MySQL or PostgreSQL is perfectly capable of storing and processing relatively large analytical datasets (millions of records). Models with hundreds of millions and billions of records require specialized solutions, using either column-based data warehouses or Hadoop-like clusters with a parallel file system.

Connecting to Marketo API

etlworks-marketo

Marketo exposes a REST API which allows for remote execution of many of the system’s capabilities.  From creating programs to bulk lead import, there are a large number of options which allow fine-grained control of a Marketo instance.

Read about Marketo REST API.

Prerequisites

Assuming that you already have Marketo account, in order to connect to the Marketo REST API you must have the following:

  • The base URL for the Marketo instance, for example, 377-XYZ-015.mktorest.com
  • Client ID
  • Client Secret

This article explains how to obtain the Client ID and the Client Secret.

Creating a connection to the Marketo REST API

In this section, we will be showing you how to create a connection to the Marketo REST API endpoint which returns available fields for the leads.

Step 1. Add a new HTTP connection in Etlworks.

Step 2. Enter the following URL in the “URL” field.

https://base-url/rest/v1/leads/describe.json?access_token={access_token}

where the base-url is the URL from the “Prerequisites” section, for example, 377-XYZ-015.mktorest.com.

Notice the leads/describe.json which is the API endpoint that returns the fields for the leads.

Step 3. Select GET as a “Method”.

Step 4. Select OAuth2 for “Authentication”. In the “User” field enter the “Client ID” from the “Prerequisites” section and in the “Password” field enter the “Client Secret” from the same section.

Step 5. In the “Authentication URL” field enter the following URL:

https://base-url/identity/oauth/token?grant_type=client_credentials&client_id={user}&client_secret={password}

where the base-url is the URL from the “Prerequisites” section.

Step 6. Select POST for “HTTP Method for Token and OAuth2 Authentication”.

Step 7. Select application/json as an “Authentication Request Content Type”.

Step 8. Give the connection a name and save it.

Testing Marketo REST API connection

The following shows you how to test the connection using Explorer.

Step 1. If you don’t have it already create the JSON format. Use all the default settings.

Step 2. Go to Explorer, select the connection created in the previous section and click “expand” (down-arrow) button. Link the format created in step 1 to the connection.

Step 3. Click the “View Data” button to view the data in a grid format or “Raw Data” to view the data in the JSON format.

Connecting to Microsoft Graph API

etlworks-azure

Microsoft Graph provides a unified programmability model that you can use to take advantage of the tremendous amount of data in Microsoft 365, Azure Active Directory, and other Microsoft services.

Read about Microsoft Graph and Microsoft Graph API.

This blog post provides step-by-step instruction for creating a connection to the Microsoft Graph API in Etlworks.

Registering your app with Azure AD endpoint

Assuming that you already have Office 365 or Azure account, the first step is to create an app and register it with Azure AD endpoint.

Step 1. Login to Azure Portal and select “Azure Active Directory” in the left side-bar.

Step 2. Select “App Registration” and click “New application registration”.

Step 3. Enter the name of the application, for example, Office 365 connector, select Web app / API in the “Application type” and enter https://api.etlworks.com in the “Sign-on URL”.

Step 4. Click “Create” and wait while application being created.

Step 5. Click “Settings” and then expand “Properties”. Copy and save somewhere “Application ID” and “App ID URI”. We will be using both later when creating a connection to the Microsoft Graph API.

Step 6. Expand “Required permissions” and click “Add”.

Step 7. Click “Select an API” and add all required APIs. For this example we will need Microsoft Graph.

Step 8. Click “Select permissions” and select Read all users’ full profiles.

Step 9. Click “Select” and “Done”.

Step 10. Expand “Keys”.

Step 11. Enter key parameters, for example, api key, Never expires, graph api key.

Step 12. Click “Save”. Copy and save the key value. It is not going to be available after you leave this screen.

Creating a connection to the Microsoft Graph API

In this section, we will be showing you how to create a connection to the Graph API endpoint which returns users in the Azure Active Directory for the specific Office 365 account.

Step 1. Add a new HTTP connection in Etlworks.

Step 2. Enter the following URL in the “URL” field.

https://graph.microsoft.com/v1.0/users

Step 3. Select GET as a “Method” and application/json as a “Content Type Header”.

Step 4. Add the following “Header” (name&value)

name=content-type
value=application/json;odata.metadata=minimal;odata.streaming=true;IEEE754Compatible=false;charset=utf-8

Step 5. Select OAuth2 for “Authentication”. In the “User” field enter the “Application ID” from step 5 of the “Register your app with Azure AD” and in the “Password” field enter the key value from step 12.

Step 6. In the “Authentication URL” field enter the “”App ID URI” from step 5 of the “Register your app with Azure AD”.

Step 7. Select POST for “HTTP Method for Token and OAuth2 Authentication”.

Step 8. Enter the following string in the “Authentication Request Payload” field:

&client_id={user}
&scope=https%3A%2F%2Fgraph.microsoft.com%2F.default
&client_secret={password}
&grant_type=client_credentials

Step 9. Select application/x-www-form-urlencoded as an “Authentication Request Content Type”.

Step 10. Give the connection a name and save it.

Testing Microsoft Graph API connection

The following shows you how to test the connection using Explorer.

Step 1. If you don’t have it already create the JSON format. Use all the default settings.

Step 2. Go to Explorer, select the connection created in the previous section and click “expand” (down-arrow) button. Link the format created in step 1 to the connection.

Step 3. Click the “View Data” button to view the data in a grid format or “Raw Data” to view the data in the JSON format.

Connecting to Salesforce API

etlworks-salesforce

Updated (5/7/2019). Etlworks now includes a native connector for Salesforce but this article is still relevant if you need to access various Salesforce APIs (for example a streaming REST API) not supported by the native connector.

In Etlworks, it is possible to connect to practically any HTTP-based API -the  Salesforce API is no exception. This blog post provides step-by-step instruction for creating a connection to the Salesforce REST API in Etlworks.

Creating Connected App in Salesforce.

Assuming that you already have a Salesforce account, the first step is to create a connected app in Salesforce.

Step 1. Login into Salesforce and click the Setup icon (looks like gears) in the top navigation banner.

Step 2. Search for “apps” and select the “Apps/App Manager” link in the left side-bar.

Step 3. Click the “New Connected App” button in the top right corner.

Step 4.  Enter all required parameters and check Enable OAuth settings.

Step 5. In the “Selected OAuth Scopes” settings, add the Full Access (full) scope.

Step 6. Enter the following URL in the “Callback URL” field:

https://app.etlworks.com/salesforce/callback

Step 7. Click the “Save” button and continue to the next screen (there will be a message saying that you need to wait for a few minutes before you can start using the app).

Step 8. In the next screen, under the “API (Enable OAuth Settings)” there will be a “Consumer Key” and a “Consumer Secret”. Copy and save them somewhere, we will need them later when you create a connection to the Salesforce API.

Creating a connection to Salesforce API

In this section, we will be showing you how to connect to the Query API, which takes a SQL-like query as an URL parameter and returns a JSON for the requested object.

Step 1. Add a new HTTP connection in Etlworks.

Step 2. Enter the following URL in the “URL” field.

https://your_salesforce_instance.lightning.force.com/services/data/v20.0/query?q=SELECT+name+,+email+,+Username+,+LastName+,+FirstName+,+IsActive+,UserRole.Name+,+Id+from+User

As you can see, we are using a query API to get all the users under your Salesforce account, together with roles.

Step 3. Select GET as a “Method” and application/x-www-form-urlencoded as a “Content Type Header”.

Step 4. Select oauth2 as an “Authentication”. Enter URL encoded username in the “User or Access Key” field and the password in the “Password” field. For example, if the username is first.last@company.com, the encoded URL is going to be first.last%40company.com.

Step 5. Enter the following string in the “Authentication URL” field:

https://login.salesforce.com/services/oauth2/token

Select POST in the “HTTP Method for Token and OAuth2 Authentication”.

Step 6. Enter the following string in the “Authentication Request Payload” field:

client_id=client_id&client_secret=client_secret&grant_type=password&username={user}&password={password}

where the client_id is the “Consumer Key”  and the client_secret is the “Consumer Secret” from step 8 in the “Creating Connected App in Salesforce” section.

Step 7. Select application/x-www-form-urlencoded as an “Authentication Request Content Type”.

Step 8. Give the connection a name and save it.

Testing Salesforce connection

The following shows you how to test the connection to the Salesforce API using Explorer.

Step 1. If you don’t have it already create the JSON format. Use all the default settings.

Step 2. Go to Explorer, select the connection created in the previous section and click “expand” (down-arrow) button. Link the format created in step 1 to the Salesforce connection.

Step 3. Click the “View Data” button to view the data in a grid format or “Raw Data” to view the data in the JSON format.

Working with paginated API

In this blog post, I will discuss how to work with paginated APIs in Etlworks. 

Most (but not all) of the APIs put the limit on the amount of data that can be returned or uploaded in one request. They provide a pagination mechanism by responding with the information on how to access the next “page” (hence pagination).

In Etlworks retriving data from the paginated API can be impemnted using nested flow with the loop, together with the technique called dynamic URLs.
Common pagination patterns

Let’s look at the most common pagination patterns:

  • The API where the amount of data is controlled by the parameters in the request and the loop stops when there is no more data to return. I will be using Jira API to illustrate the example.
  • The API where the server controls the amount of data by providing the “next marker” in the response, so the loop stops when there is no “next marker”. I will be using API provided by an organization called UKRI Gateway to illustrate the example.
API with the page number (JIRA)

The URL for the typical Jira API endpoint looks like below:

https://company.atlassian.net/rest/api/3/searchjql=query
&startAt=pointer&maxResults=10

When called it returns the following JSON:

{"expand": "schema,names",
    "startAt": 0,
    "maxResults": 10,
    "total": 2823,
    "results": [...]}

As you can see, the request contains the starting pointer and the maximum number of results. The response includes the total number of results.

So, the algorithm to iterate through the pages will look like below:

  1. Make the first call with startAt=0 and maxResults=1
  2. Extract total from the response and calculate the total number of calls which need to be made to the endpoint, with maxResults=100 (100 is a maximum allowed number of results).
  3. Create a loop to iterate through the pages. The loop ends when the number of calls made to the endpoint reaches the total number of calls, calculated in step 2. 

Now, let’s put it all together.

In step 1, we will need to create connections for the example.

  1. HTTP connection for the JIRA API. 
  2. Database connection to the PostgreSQL database.

The URL for the HTTP connection which returns JIRA ticket for the project TEST will look like below (we are using the version of the API for the JIRA cloud, so substitute company in the URL on your actual company name). Note that JIRA API uses preemptive basic authentication.   

https://company.atlassian.net/rest/api/3/searchjql=project%20%3D%20TEST
&startAt={startAt}&maxResults=100

Notice that the value for the startAt parameter in the URL includes token {startAt}. The idea that in the loop we will be replacing token {startAt} on the actual values: 0, 100, 200, 300, etc.

For this example, I will be using a database loop so we will need one more connection to the temporary database, which will be used to drive the loop.

In step 2, we will be creating the flows.

1.  Create SQL flow which is used to create a temporary loop table using SQL below:

create table pages (startAt integer)

2.  Create JavaScript flow to populate the loop table:

importPackage(com.toolsverse.etl.core.engine);
importPackage(com.toolsverse.util);
importPackage(com.toolsverse.config);
importPackage(com.toolsverse.etl.sql.util);
importPackage(java.util);

var pageSize = 100;

var params = Extractor.lookup(etlConfig, scenario, 
"First", "params", "select total"); var totalNumberOfRecords = params.getFieldValue(params.getRecord(0),
"total"); var numberOfPages = Math.floor(totalNumberOfRecords / pageSize); var remainder = totalNumberOfRecords % pageSize; numberOfPages = totalNumberOfRecords == 0 ? 0 :
(numberOfPages == 0 ? 1 :
(remainder > 0 ? numberOfPages + 1 : numberOfPages)); var connection = etlConfig.getConnectionFactory().
getConnection("Temporary database"); var sql = "insert into pages (startAt) values (?)"; var startAt = 0; for (var page = 0; page <numberOfPages; page++) { SqlUtils.executeSql(connection, sql, startAt); startAt = startAt + pageSize; }

3. Create a flow “Extract data from the Web Service and load into the database” where the source is a web service (Jira endpoint above) and the destination is a PostgreSQL database.

4.  Create a nested flow which combines flows 1, 2, and 3.

5.  Modify step 3 by configuring a database loop where the connection is a Temporary database and the driving SQL is:

select startAt as "startAt" from pages

Download JIRA example from the etlworks.com. You can import the example into your Etlworks account and try it out.  Don’t forget to edit the PostgreSQL and JIRA connections before executing the flow.

Download a fully functional example for another paginated api with page number query parameter from the etlworks.com. You can import the example into your Etlworks account and try it out without changing anything. This flow pulls the data from the paginated API and loads in the shared Google Sheet. 

API with the next marker (UKRI Gateway)

For this example, I will be using UKRI Gateway API endpoint with the following URL:

https://www.ebi.ac.uk/europepmc/webservices/rest/search
?query=italy&cursorMark={NEXT_MARKER}

If the value of the cursorMark query parameter is * (star character) the API returns the first page, which contains data and possible next value for the cursorMark.

<response>
<nextCursorMark>535543435GUI999==</nextCursorMark> 
<data>
...
</data>
</reponse>

So, the algorithm to iterate through the pages will look like below:

  1. Make the first call with cursorMark=*
  2. Get the value for the cursorMark from the response by extracting the value of the nextCursorMark field.
  3. Create a loop to iterate through the pages. The loop ends when the response doesn’t have the nextCursorMark

Now, let’s put it all together.

In step 1, we will need to create connections for the example.

  1. HTTP connection for the UKRI Gateway API endpoint above.   
  2. Connection to the shared Google Sheet.

Notice that the value of cursorMark query parameter is a token {NEXT_MARKER}. The idea is to replace the token with the next marker, make the API call and repeat until there is no next marker anymore.

We will be loading data into the shared Google Sheet by calling the API endpoint multiple times so we will need a connection to the temporary staging database to dump the data from the API before loading the whole thing into the Google Sheet.

In step 2, we will be creating the flows.

1. Create the JavaScript flow to set the original value of the {NEXT_MARKER} token as (start character).

importPackage(com.toolsverse.config);
SystemConfig.instance().getProperties().put("NEXT_MARKER", "*");

2. Create a flow “Extract data from the Web Service and load into the database” where the source is a web service and the destination is a temporary database.

3. Click the “Mapping->Additional Transformations” and add the following JavaScript as a value for the After Extract transformation:

importPackage(com.toolsverse.util);
importPackage(com.toolsverse.config);

var nextCursorMark = dataSet.getFieldValue(dataSet.getRecord(0), 
"nextCursorMark"); if (Utils.isNothing(nextCursorMark)) { nextCursorMark = ""; } SystemConfig.instance().getProperties().put("NEXT_MARKER",
nextCursorMark);

4. Create flow “Extract data from database, push into the well known API” where the source is a temporary database and the destination is a shared Google Sheet.

5.  Create nested flow which combines flows 1, 2, 3, and 4.

Modify step 2 by configuring a JavaScript loop, where the JavaScript for the loop is:

importPackage(com.toolsverse.config);
importPackage(com.toolsverse.util);

var nextCursorMark = SystemConfig.instance().getProperties().
get("NEXT_MARKER");

value = !Utils.isNothing(nextCursorMark) ? nextCursorMark : null;

Download a fully functional example for paginated api with next marker from the etlworks.com. You can import the example into your Etlworks account and try it out without changing anything. This flow pulls the data from the paginated API and loads in the shared Google Sheet.