Working with deltas Prague, 11-08-2014
by Tomas Trnka
So, you have a whole ETL chain set up in Keboola Connection:
1) You are extracting data
2) You are transforming data
3) You are merging and saving the final data into one output table
And that can be done either incrementally or as a full batches.
BUT your project sponsor wants just the increments (deltas) as (an on-demand) outputs.
That means - Sponsor wants just the new records that has not been processed; already in the output file.
So.. How to do that?
You will need 3 transformations and 3 tables in total.
1) Original transformation
That transformation writes (exports) the “final” data into the “all” table.
So far, nothing needs to be done!
2) Difference transformation
This transformation compares the “all” table and the “previous” table and creates the “difference” table. (note. for the first time only, you need to create blank “previous” table manually with the same header as “all”)
So once you run this transformation, it loads table “all” and table “previous”. The code then compares the two by left-joining “previous” table to the main “all” table WHERE your primary key is NULL. This results into having just the NEW, not processed rows from table “all” that will be saved/exported into the “difference” table.
example with the full mysql query and custom primary key (`E_ID`)
3) Previous transformation
From step 1) and 2) we have tables: “all”, “previous” and “difference”.
So its ok, we have our delta (“difference”) and can go home, or can`t we?
What if we run the ETL again with updated “all” table with newer data? The “difference” will contain exact copy of “all” because the “previous” table is still empty.
So here comes the “previous” transformation, that takes the content of the “difference” table and writes INCREMENTALLY the same data into “previous” table AND at the same time it empties the difference table.
That means that the “previous” table is still growing and basically “trying to catch” the full “all” table. The “difference” table is either empty (in that case “all” = “previous” == nothing changed) or contains something and in that case the difference between these two tables is sitting in the “difference” table.
The beauty here is in the “idiot proof” concept, because you can run extraction and transformation 1) and 2) as many times you want to and the only thing that would happen is the growing “all” table and growing “difference” table. Without any fear that you accidentally ran the transformation twice or that the orchestrator was initiated multiple times.
So its a good idea to put the third “3) Previous transformation” outside the orchestrator and give client the tokens and API call to run the third transformation “on-demand”.
The grande finale then looks like this:
1) Orchestrator is handling everything: data extraction, transforming them, merging them, saving them. Multiple times, no worries here.
2) Client`s side gets the data by downloading the “difference” table AND calls the API call to run the “3) Previous transformation”
3) The “3) Previous transformation” clears the “difference” and moves the data to “previous” table.
4) Here we go, ready for next round!
*Of course the setting can be modified and for example whole chain including the “3) Previous transformation” can be contained in the orchestrator’s tasks + for example S3 writer so that its us who pushes the data instead of client downloading them.