Etlworks Marketo Integration

etlworks-marketo-data-integration

What is Marketo?

Marketo is a cloud-lead management and marketing solution. The product range of Marketo is provided on a subscription basis and covers Lead Management, Sales Insights, Revenue Cycle Analytics and Social Marketing applications. It helps organizations automate and measure marketing engagement, tasks, and workflows, including those for email, mobile, social, and digital ads.

What is Etlworks?

Etlworks is a cloud-native integration platform helps businesses automate manual data management tasks, ensure data that are far more accurate, accelerate business workflows, and provide greater operational visibility to an organization.

After a few minutes setup, Etlworks replicates all your applications, databases, events and files into a high-performance data warehouse like Snowflake or Amazon Redshift, so that you can then use your favorite BI or analytics tools. Create reports, monitor custom dashboards, and more instantly from the cloud.

Connect Marketo to Anything

Etlworks offers connectivity to Marketo’s APIs enabling you to work with key Marketo entities including Lead, Activity, List, Opportunity, OpportunityRole as well as Custom Objects. Etlworks exposes both the SOAP and REST APIs for Marketo ensuring you can handle any integration task.

Use the Etlworks Marketo connector for data integration between Marketo and your CRM system, such as Salesforce, MS Dynamics, SugarCRM, HubSpot, and NetSuite; collaboration or survey tools; webinar platforms; data services; marketing databases; and more.

Etlworks Marketo connector free you to focus on insights, so your company will be faster and more efficient at optimizing your marketing performance and improving your campaigns’ ROI.

Etlworks partnered with CData to provide access to the Marketo API using industry standard JDBC protocol.

Let’s do it!

Connecting to Marketo

Step 1. Obtaining the OAuthClientId and OAuthClientSecret Values. To obtain the OAuthClientIdand OAuthClientSecret, navigate to the LaunchPoint option on the Admin area. Click the View Details link for the desired service. A window containing the authentication credentials is displayed.

Step 2. Obtaining the REST Endpoint URL. The RESTEndpoint can be found on your Marketo Admin area on the Integration -> Web Services option in the REST API section. Note the Identity Endpoint will not be needed.

Step 3. Enable Marketo connector for your Etlworks account. Contact support@etlworks.com to enable connector.

Step 4. Create a Marketo connection to work with data in Marketo.

Stored Procedures

Stored Procedures are available to complement the data available from the REST Data Model. Sometimes it is necessary to update data available from a view using a stored procedure because the data does not provide for direct, table-like, two-way updates. In these situations, the retrieval of the data is done using the appropriate view or table, while the update is done by calling a stored procedure. Stored procedures take a list of parameters and return back a dataset that contains the collection of tuples that constitute the response.

To call stored procure from the SQL flow or from Before/After SQL use EXEC sp_name params=value syntax. Example:

EXEC SelectEntries ObjectName = 'Account'

Extracting data from Marketo

Note: extracting data from Marketo is similar to extracting data from the relational database.

Step 1. Create a Marketo connection which will be used as a source (FROM).

Step 2. Create a destination connection, for example, a connection to the relational database, and if needed a format (format is not needed if the destination is a database or well-known API).

Step 3. Create a flow where the source is a database and the destination is a connection created in step 2, for example, relational database.

mceclip0

Step 4. Add new source-to-destination transformation.

Step 5. Select Marketo connection created in step 1 as a source connection and select the Marketo object you are extracting data from:mceclip0 (1)

Step 6. Select TO connection, format (if needed) and object (for example database table) to load data into.

mceclip3

Step 7. Click MAPPING and optionally enter Source Query (you don’t need a query if you are extracting data from the Marketo object unconditionally).

Step 8. Optionally define the per-field mapping.

salesforce-mapping (1)

Step 9. Add more transformations if needed.

Loading data in Marketo

Note: loading data in Marketo is similar to loading data into relational database.

Step 1. Create a source connection and a format (if needed).

Step 2. Create destination Marketo connection.

Step 3. Create a flow where the destination is a database.

Step 4. Add new source-to-destination transformation.

Step 5. Select FROM and TO connections and objects (also a FROM format if needed).

mceclip5

Step 6. Optionally define the per-field mapping.

Step 7. Add more transformations if needed.

Browsing data in Marketo

You must have a Marketo connection to browse objects and run SQL queries.

Use Explorer to browse data and metadata in Marketo as well as execute DML and SELECT queries against Marketo connection.

mceclip4

Ready to get started?

Contact Etlworks today to connect your Marketo instance with Etlworks and unlock the ability to read and replicate many of the objects to your data destination.

ETL/ELT all your data into Amazon Redshift DW

amazon_integration

Amazon Redshift is fast, scalable, and easy-to-use, making it a popular data warehouse solution. Redshift is straightforward to query with SQL, efficient for analytical queries and can be a simple add-on for any organization operating its tech stack on AWS.

Amazon Web Services have many benefits. Whether you choose it for the pay as you go pricing, high performance, and speed or its versatile and flexible services provided, we are here to present you the best data loading approaches that work for us.

Etlworks allows users to load your data from cloud storages and APIs, SQL and NoSQL databases, web services to Amazon Redshift data warehouse in a few simple steps. You can configure and schedule the flow using intuitive drag and drop interface and let Etlworks do the rest.

Etlworks supports not just one-time data loading operation. It can help you to integrate your data sources with Amazon Redshift and automate updating your Amazon Redshift with fresh data with no additional effort or involvement!

Today we are going to examine how to load data into Amazon Redshift.

A typical Redshift flow performs the following operations:

  • Extract data from the source.
  • Create CSV files.
  • Compress files using the gzip algorithm.
  • Copy files into Amazon S3 bucket.
  • Check to see if the destination Amazon Redshift table exists, and if it does not – creates the table using metadata from the source.
  • Execute the Amazon Redshift COPY command.
  • Clean up the remaining files.

There are some prerequisites have to be met, before you can design a flow that loads data into Amazon Redshift:

Now, you are ready to create a Redshift flow. Start by opening the Flows window, clicking the + button, and typing redshift into the search field:

redshift-flows

Continue by selecting the flow type, adding source-to-destination transformations and entering the transformation parameters:

redshift-transformation

You can select one of the following sources (FROM) for the Redshift flow:

  • API – use any appropriate string as the source (FROM) name
  • Web Service – use any appropriate string as the source (FROM) name
  • File – use the source file name or a wildcard filename as the source (FROM) name
  • Database – use the table name as the source (FROM) name
  • CDC – use the fully qualified table name as the source (FROM) name
  • Queue – use the queue topic name as the source (FROM) name

For most of the Redshift flows, the destination (TO) is going to be Amazon S3 connection. To configure the final destination, click the Connections tab and select the available Amazon Redshift connection.

redshift-connection

Amazon Redshift can load data from CSVJSON, and Avro formats but Etlwoks supports loading only from CSV so you will need to create a new CSV format and set it as a destination format. If you are loading large datasets into Amazon Redshift, consider configuring a format to split the document into smaller files. Amazon Redshift can load files in parallel, also transferring smaller files over the network can be faster.

If necessary, you can create a mapping  between the source and destination (Redshift) fields.

Mapping is not required, but please remember that if a source field name is not supported by Redshift, it will return an error and the data will not be loaded into the database. For example, if you are loading data from Google Analytics, the output (source) is going to include fields with the prefix ga: ( ga:user, ga:browser, etc. ). Unfortunately, Amazon Redshift does not support fields with a : , so the data will be rejected. If that happens, you can use mapping to rename the destination fields.

ELT for Amazon Redshift

Amazon Redshift provides affordable and nearly unlimited computing power which allows loading data to Amazon Redshift as-is, without pre-aggregation, and processing and transforming all the data quickly when executing analytics queries. Thus, the ETL (Extract-Transform-Load) approach transforms to ELT (Extract-Load-Transform). This may simplify data loading to Amazon Redshift greatly, as you don’t need to think about the necessary transformations.

Etlworks supports executing complex ELT scripts directly into Amazon Redshift which greatly improves performance and reliability of the data injection.

I hope this has been helpful. Go forth and load large amounts of data.

Data Replication Methods

database-replication

Data replication takes data from your source databases — Oracle, MySQL, Microsoft SQL Server, PostgreSQL, MongoDB, etc. — and copies it into your destination data warehouse. After you have identified the data you want to bring in, you need to determine the best way to replicate the data so it meets your business needs.

Choosing the right method

The method you choose impacts the end state of your data. Fortunately, there are data replication methods built to integrate with today’s data warehouses and suit a variety of use cases. At Etlworks, we believe in providing users with as much flexibility as possible. Let’s discuss each of the five methods of data replication and outline the option that may be best for you.

High Watermark (HWM)

The concept of Watermark refers to a flood after-match in which you look at the water stains in a wall to figure how high the water got, which is pretty much what we want to do: figure out which was the last item we updated and move from there on. Therefore, Watermark is a tool to simplify querying for updated objects, which is a very common use case when synchronizing data.

Pros:

  • fast
  • works for all data sources, including all databases, files, and APIs

Cons:

  • does not support deletes
  • requires a dedicated high watermark field in each table

Change Data Capture (CDC)

CDC is an approach to data integration that is based on the identification, capture, and delivery of the changes made to the source database and stored in the database ‘redo log’, also called ‘transaction log’. CDC or Log Replication is the fastest and most reliable way to replicate. It involves querying your database’s internal change log every few seconds, copying the changes into the data warehouse, and incorporating them frequently. CDC is the best method for databases that are being updated continually and fully supports deletes.

Pros:

  • fast
  • no polling from database tables – uses database redo log instead
  • supports deletes
  • enables near real-time replication

Cons:

  • currently supports only Postgres, MySQL, SQL Server, and Oracle
  • some older versions of the databases above do not support CDC
  • requires extra setup in the source database

Database Triggers

Trigger-based change replication can be implemented in many ways but the basic idea is that each table, which participates in a change replication as a source, has triggers for INSERT, UPDATE, and optionally DELETE. The triggers update the shadow table (or tables). The shadow tables may store the entire row to keep track of every single column change, or only the primary key is stored as well as the operation type (insert, update or delete).

Pros:

  • works for any source database which has triggers
  • no extra requirements for the specific version of the database or extra field in each table

Cons:

  • requires adding triggers to all database tables
  • triggers can negatively impact performance

Real-time CDC with Kafka

Apache Kafka is a popular technology to share data between systems and build applications that respond to data events. Etlworks completes Apache Kafka solutions by delivering high-performance real-time data integration.

Etlworks parses the CDC events emitted to the Kafka topic, automatically transforms events to the DML SQL statements (INSERT/UPDATE/DELETE), and executes SQL statements in the target database in the order they were created. It also handles the collisions and errors, ensuring that the solution is 100% reliable.

Pros:

  • fast
  • no polling from database tables
  • supports deletes
  • supports real-time replication

Cons:

  • complex setup (requires Kafka, Zookeeper, Kafka Connect, and Debezium)
  • supports only Postgres, MySQL, SQL Server, Oracle, and MongoDB
  • some older versions of the databases above do not support CDC
  • requires extra setup in the source database

Full refresh

Sometimes the simplest approach is the best. Full refresh replication method is best for small tables, static data, or one-time imports. Because it takes time to perform the full refresh, it’s a typically slower method than the others.

Pros:

  • the simplest to setup
  • can be quite fast for the relatively small datasets (<100K records)
  • works for all data sources

Cons:

  • not recommended for large datasets

Want to learn more about our replication options and what’s best for your data? Talk to us!

Real-time change replication with Kafka and Debezium

etlworks-real-time-data-streaming-1200

In this article, I demonstrate how to implement [near] real-time Change Data Capture, or CDC, -based change replication for the most popular databases using the following technologies:

  • Native CDC for each source database
  • Apache Kafka
  • Debezium
  • Etlworks Kafka connector with built-in support for Debezium

Overview

Change Data Capture (CDC), as its name suggests, is a design pattern that captures individual data changes instead of dealing with the entire data. Instead of dumping your entire database, using CDC, you would capture just the data changes made to the master database and apply them to the BI databases to keep both of your databases in sync.

Debezium is a CDC tool that can stream changes from Microsoft SQL Server, MySQL, MongoDB, Oracle and PostgreSQL into Kafka, using Kafka Connect.

Kafka Connect is a tool for streaming data between Apache Kafka and external systems. It is used to define connectors that move large collections of data into and out of Kafka.

Etlworks Integrator parses the CDC events emitted to the Kafka topic, automatically transforms events to the DML SQL statements (INSERT/UPDATE/DELETE), and executes SQL statements in the target database in the order they were created. It also handles the collisions and errors, ensuring that the solution is 100% reliable.

Etlworks - CDC with Kafka and Debezium

Anything can be a destination: SQL and NoSQL data sources, online data warehouses such as Snowflake and Redshift, files, API endpoints, etc. The following databases are supported as a source:

  • Microsoft SQL Server
  • MongoDB
  • MySQL
  • Oracle
  • PostgreSQL

Prerequisites

The solution requires installing and configuring Apache Kafka, Kafka Connect, and Debezium.

Installing and configuring Kafka and Kafka connect (Kafka component required for CDC) is not part of this tutorial. In most cases installing Kafka is as easy as downloading the latest version of the standalone or dockerized Kafka and Zookeeper. Kafka Connect is typically included in all distribution packages so there is nothing to install. Debezium can be installed as a plugin for Kafka Connect by simply copying required libraries to the KAFKA_HOME/plugin folder.

For customers on Enterprise plans, Etlworks installs all required components. We provide assistance with installing components to the self-hosted customers.

Please contact Etlworks support if you want to enable a real-time change replication for your account.

Solution

The CDC events are serialized as JSON or Avro documents and can be transformed using any of the available in Etlworks transformations.

The basic setup for real-time CDC-based change replication in Etlworks is very easy:

  1. Setup CDC for the source database.
  2. Configure Debezium to capture CDC events and publish them to the Kafka topic(s).
  3. Create a change replication flow where the source is a Kafka topic and the destination is a target database table.
  4. Schedule the flow.

1. Setup CDC for the source database

Enabling CDC is different for each database. Please use the following tutorials:

2. Configure Debezium to capture CDC events and publish them to the Kafka topic(s)

Assuming that the Debezium is already installed as a Kafka Connect plugin and up and running, we will be configuring a connector to the source database using Kafka Connect REST API. In this tutorial, we will be using Microsoft SQL Server, but configuring connectors to other databases is equally simple.

Official tutorials:

The Debezium connectors are created using Kafka Connect REST API so make sure either curl or Postman is installed in your development box. In this tutorial, we will be using Postman.

Step 1. Verify that Kafka Connect is installed and running.

The default port for Kafka Connect API is 8083. Assuming that it runs on localhost the URL for the API endpoint which returns configured connectors is:

http://localhost:8083/connectors

get-connectors.png

Step 2. Create a new connector for Microsoft SQL Server.

ENDPOINT URL: http://localhost:8083/connectors 

METHOD: POST

PAYLOAD (example):

{
 "name": "sqlserver-connector",
 "config": {
 "connector.class": "io.debezium.connector.sqlserver.SqlServerConnector",
 "tasks.max": "1",
 "database.hostname": "localhost",
 "database.port": "1433",
 "database.user": "sa",
 "database.password": "password",
 "database.dbname": "database_name",
 "database.server.name": "database_server_name",
 "table.whitelist": "comma separated list of fully_qualified_table_names",
 "database.history.kafka.bootstrap.servers": "localhost:9092",
 "database.history.kafka.topic": "dbhistory.database_server_name.database_name",

 "transforms": "unwrap",
 "transforms.unwrap.type": "io.debezium.transforms.UnwrapFromEnvelope",
 "transforms.unwrap.drop.tombstones": "false",
 "transforms.unwrap.operation.header": "true",

 "key.converter": "org.apache.kafka.connect.json.JsonConverter",
 "key.converter.schemas.enable": "false",
 "value.converter": "org.apache.kafka.connect.json.JsonConverter",
 "value.converter.schemas.enable": "false",
 "include.schema.changes": "false"
 }
}

create-connector.png

CDC Event Flattening

the payload contains two important sections required for integrating Etlworks Kafka connector with Debezium: 

1. The unwap transfromation, which creates a flat version of the CDC event:

"transforms": "unwrap",
"transforms.unwrap.type": "io.debezium.transforms.UnwrapFromEnvelope",
"transforms.unwrap.drop.tombstones": "false",
"transforms.unwrap.operation.header": "true",

2. The settings which remove the schema information from the CDC event:

"key.converter.schemas.enable": "false",
"value.converter.schemas.enable": "false",

The naming convention for Kafka topics

Debezium stores CDC events in a separate topic for each table. If the connector was configured using the following parameters:

"database.dbname": "database_name",
"database.server.name": "database_server_name",

the CDC events for the table cdc_test will be stored in a Kafka topic database_server_name.database_name.cdc_test.

Examples of the generated CDC events

Etlworks automatically parses CDC events stored in a Kafka topic so you don’t have to deal with this, but it is still a good idea to learn how exactly the generated events look like for the different DML statements.

Assuming that we are using a JSON format for serialization of the CDC events (default), also assuming that the source table was created using the following SQL:

CREATE TABLE dbo.cdc_test(id INT,
 NAME VARCHAR(255),
 changed DATETIME,
 PRIMARY KEY (id))

the generated CDC events will look like the following:

INSERT INTO dbo.test2 (id,name,changed) values (1,'test1',CURRENT_TIMESTAMP)

Key: {“id”:1}

Value: {“id”:1,”name”:”test1″,”changed”:1552064244733}

Header: __debezium-operation=c

UPDATE dbo.cdc_test SET name = 'updated-test1-1', changed=CURRENT_TIMESTAMP 
WHERE id = 1

Key: {“id”:1}

Value: {“id”:1,”name”:”updated-test1-1″,”changed”:1552064295845}

Header: __debezium-operation=u

DELETE FROM WHERE id = 1

Key: {“id”:1}

Value: none

Header: __debezium-operation=DELETE

Configuring serialization format for DATE/TIME fields

As you probably noticed, the value of the timestamp field changed is generated as Unix epoch time in miliseconds. You can convert it to the human (and database) readable format using TimestampConverter transformation:

"transforms":"unwrap,convert",
 
"transforms.convert.type":"org.apache.kafka.connect.transforms.TimestampConverter$Value",
"transforms.convert.target.type":"string",
"transforms.convert.field":"changed",
"transforms.convert.format":"yyyy-MM-dd HH:mm:ss"

3. Create a change replication flow

Step 1. Create source Kafka connection. When creating connection select Debezium as CDC provider. Enter a wildcard topic name in the format database_server_name.database_name.*.

Step 2. Create a JSON format with all default settings.

Step 3. Create a connection to the destination database. Disable the auto-commit.

Step 4. Create a new flow by selecting Stream data from queue to database from the gallery.

Step 5. Add a new source-to-destination transformation where:

  • the FROM connection is the connection to Kafka created in step 1.
  • the FROM format is the JSON format created in step 2.
  • the FROM is a topic name for the specific table.
  • the TO is a destination table name.
  • the TO connection is the destination database connection created in step 3.

debezium-cdc-transformation.png

Step 6. Click the MAPPING button, select Parameters tab and configure the following parameters:

  • Stream Data – enabled
  • Use Bind Variables – enabled
  • Create Tables and Adjust Fields – enabled
  • Action – Record
  • Lookup Fields – the unique field(s) for the record. In our example the unique field is ID.

cdc-transformation.png

Step 7. If the foreign constraints are disabled (or do not exist) in the destination database, you can enable processing of each transformation in a parallel thread.

parallel-cdc.png

Step 8. Add the source-to-destination transformations for all tables with enabled CDC. You can duplicate the existing transformation and change the topic name (the format is server_name.database_name.table_name), the destination table name, and the lookup field(s).

4. Schedule the flow

Schedule the flow to be executed in real-time.