Etlworks CEO: Efficient data structures and real-time data streaming

Data-Mgmt

Search Business Analytics recently interviewed Etlworks CEO, Maksym Sherbinin about the importance of creating efficient data structures and real-time data streaming.

Here are his insights:

Q: To what extent does defaulting to JSON as a data storage mechanism contribute to decreased analytics performance and larger file sizes?

A: There are two factors that contribute to JSON (as well as CSV and XML) not being an optimal format for storing and processing large datasets:

  • The lack of schema enforcement, which results in the same data being injected and stored multiple times in the data warehouse, definitely has a direct impact on performance, quality of service, and cost.
  • JSON-encoded datasets cannot be stored in a parallel manner, decreasing its usefulness for systems built on top of the Hadoop and other frameworks used for distributed processing.

Q: What are the existing alternatives for developers to create more efficient data structures for different kinds of analytics workloads in terms of data/file formats?

A: There are two very distinct alternatives:

  • Column-based databases optimized for data warehousing and massively parallel processing, specifically Snowflake and Amazon Redshift.
  • Formats, optimized for use in Hadoop clusters and data streaming: Optimized Row Columnar (ORC), Avro, and Parquet.

Q: What is the current state of tools that can automate some of this process (i.e. are there tools that can look at how an app works with data and recommend a better file format or data structure, and perhaps compile or transform it into these more efficient formats like traditional programming language compilers)?

A: When it comes to encoding data in any of the Hadoop-optimized formats, such as Avro and Parquet, there are great open-source libraries available for practically any programming language known to humankind. Converting data stored somewhere else (for example in the database) to the Avro or Parquet datasets and storing them in the distributed file system can be performed using home-grown tools or (preferred) a good ETL tool, which can connect to any source, map, transform, and store data in the Hadoop cluster. Below are some of the any-to-any ETL tools which support various sources, Avro and Parquet as a destination format, and Hadoop Distributed File System (HDFS) as a destination:

  • Etlworks
  • Talend
  • Informatica

The other trend is real-time data streaming that uses distributed platforms such as Apache Kafka. The Alooma enterprise pipeline is a good example of the service, which can be used to stream large datasets to analytical platforms using datasets encoded as Avro or Parquet.

Additionally, there is a growing list of tools which can be used to inject data into the column-based data warehouses, built for the cloud, such as Snowflake and Amazon Redshift. Depending on your requirements you might consider the following tools:

  • Matillion ETL
  • Etlworks Integrator
  • AWS Glue

Finally, selecting the best format for the job is not something any tool should be automatically doing for you, but having multiple options is a must. The ability to experiment and prototype with different formats and storage options, as well as injection techniques (batch load, streaming, SQL, etc.) is why a good ETL tool goes a long way. It saves you time, money and provides a peace of mind.


Q: What are your thoughts on the best practices for finding the balance between identifying the most efficient data structures for a given analytics use case, and quickly experimenting with different analytics use cases?

A: This article describes how Uber went from the general use database, to JSON-encoded datasets, to the Hadoop-based big data platform. Granted, not all business require something as extreme, so common sense is your best friend. A typical relational database such as MySQL or PostgreSQL is perfectly capable of storing and processing relatively large analytical datasets (millions of records). Models with hundreds of millions and billions of records require specialized solutions, using either column-based data warehouses or Hadoop-like clusters with a parallel file system.