For many years, ETL daily batch job was the dominant way to perform data transformations before loading in Data Warehouse.

These days requirements are quite different starting with the most important one which is to ensure that new data has to be available for AI/ML and analysis near real time.

Moreover, classical DWH databases are these days partly  obsolete (not capable to scale out, support for various data formats) as well as Hadoop Big Data stack (complex, difficult to maintain, expensive in case of commercial variant, old technology with one exception – Apache Spark), since New SQL databases have jumped in with with resilience, linear scale-out, high availability and support for many data formats.

Lastly, classical ETL tools where based on relational database or logic in application server to perform data transformation which is also not acceptable due to variety of data types/formats and large data volumes.

In this post I’ll show how modern approach should look like.

The pipeline consists of two parts that are based on different technology.

In the first pipeline I’m using StreamSets Data Collector to get data out from Oracle into the Kafka in real-time by using CDC (Change Data Capture) technology.

This is the most efficient way of pooling the data out, since it react only when something has change (DML operation on the table under the CDC).

In the background, StreamSets Data Collector leverages Oracle LogMiner to read the data from the REDO log files, instead of database files, which results in the lowest possible impact on database server which is under the 5% (in most cases impact will be even lower such as less than 1%).

For this demo I have created two tables:


and RETAIL table

Here is how the first pipeline looks like:

StreamSets Data Collector real-time CDC pipeline

More about StreamSets Data Collector you can find on the following page:

Now when we have Data pipeline which replicates data from STORE_DETAILS and RETAIL in a real time, in the second pipeline I’ll use another product called StreamSets Transformer.

Transformer is an execution engine which processes data by sending jobs to existing Apache Spark cluster.

Since one of the main function of Spark engine is to transform large quantities of data, hence the name Transformer.

More about Transformer you can find on the following page:

Main benefit of using Transformer is to create Spark based pipelines very fast, in many cases even without writing a single line of code.

It is not necessary to have extensive knowledge of the low level API in most cases.

In addition there is a monitoring facility, alerting and many other goodies included in Transformer.

I’ll write about Transformer in the future.

For this post I want to show an end-to-end business transformation in action.

On the following picture you can see an example of quite complex Transformer pipeline.

StreamSets Transformer complex pipeline

Pipeline is reading from the Kafka topic which is destination of the first pipeline (Oracle to Kafka), but although pipeline is working in streaming node, Transformer pipeline are not for real-time transformations since the technology (Apache Spark) is for massive data processing which can take a while, it is not for real time processing.

Hence, processing will start as soon as new messages arrive will start to arrive in the Kafka topic.

Besides Kafka topic, I’m also reading data from one Oracle table directly, just to show that it is possible to combine data from different, completely incompatible data sources, and pipeline will their data and execute Spark job to perform various transformations in the background.

In the processing part of the pipeline I’m using many transformations, various calculations (Avg Lead Time), sorting, repartition etc.

As a target I’m using text file for this demo, but it can be database, Kafka, Hive, Cloud or one of the many supported targets.

Please note that if existing components (Stages) are still not good enough to perform all required computations, you have two additional two stages on your disposal in which you can write your own code in Scala or Python for custom processing or to add missing functionality.

This is all far beyond the scope of this post.

As of monitoring, apart of monitoring provided by the Transformer itself

Transformer pipeline monitoring

you can get all the details from the Spark engine about the running pipeline.

The next couple of pictures describes the level of details you can get.

Spark monitoring – completed jobs
Spark execution plan
Spark execution plan – part 2
Spark RDDs
Spark execution metrics
Spark active tasks

In the next blog, I’ll describe how to apply real-time AI/ML in the Transformer pipeline to get the best what today’s technology has to offer.

Get notified when a new post is published!



There are no comments yet. Why not start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.