To create your first transformation, and to see how Transformations are an integral part of the Keboola Connection workflow, go to our Getting Started tutorial.
Transformations allow you to manipulate data in your project. They are the tasks you want to perform (Marketing data preaggregation, Tableau denormalizer, Integrity checker or Join marketing channels and sales, etc.), and are grouped into folders called Transformation buckets. Each transformation bucket can contain any number of individual transformations. It should represent a logical set of operations you want to perform together.
The schema below shows the important components of the Transformation engine:
A transformation is represented by a Transformation Script (SQL, R, Python or OpenRefine backend) which you can use to manipulate your data. To ensure safety of the data in Storage, a transformation operates in a completely separate provisioned workspace created for each transformation. When a transformation runs, it takes the required data from Storage and copies them to the workspace. This process is called Input Mapping. The transformation engine then takes the Transformation Script and runs it in the workspace. Depending on the transformation backend, the input mapping process can do the following:
When the transformation is finished, the Output Mapping process moves the transformation results back to the designated tables in Storage. The Input and Output mappings ensure complete safety of the transformation processes – the transformation always operates in an isolated workspace.
Each transformation has its own workspace. The transformation sandbox uses the same workspace as the corresponding transformation. The plain sandbox uses a separate workspace. Component and analytical workspaces have live access to production data integrated by other data pipeline flows. Always use fresh data in your data science workspaces for experiments and modelling.
No transformation can be created without
1) Input and Output Mapping — separates the source data from your transformation. Mapping creates a secure workspace with data copied from the tables specified in the input mappings. Database table names and CSV file names in transformations are completely unrelated to names of tables in Storage.
2) Transformation Script — SQL, Python, R or OpenRefine code: defines what happens with the data while taking the tables from Input Mapping, modifying them and producing the tables referenced in Output Mapping.
When writing transformation scripts, keep in mind that the table names referenced by mappings are automatically quoted by Keboola Connection. This is especially important for Snowflake, which is case sensitive.
A backend is the engine running the transformation script. It is either a database server (Redshift, Snowflake, Synapse on Azure Stack), or a language interpreter (Julia, Matlab OpenRefine, Python, R, Spark).
How to decide which backend is appropriate for each task? A rule of thumb is that SQL performs better for joining tables, filtering data, grouping and simple aggregations. Script languages are more suitable for processing one line at a time, raw data processing or custom analytical tasks.
Each transformation within a bucket can use a different backend to perform the task with the most suitable tool and programming language. As some tasks are difficult to solve in SQL, feel free to step in with Python and finish the work with SQL again.
The following are the currently available backends:
Analytics engine — Apache Spark, and Microsoft Synapse on Azure Stack.
Each change in the transformation configuration creates a new version of the whole transformation bucket. You can easily access previous versions of all transformations in a bucket and see what has changed.
You can copy an individual transformation into a transformation bucket (either the same or a different one).
The copied transformation is independent of the original transformation.
You can easily develop Snowflake, Redshift transformations using Sandbox, a separate database storage. As a safe workspace with required data, it allows you to run and play with your arbitrary SQL scripts on the copies of your tables without affecting data in your Storage, or your transformations. For Julia, Python and R transformations, you can also use a Sandbox represented by an isolated Jupyter notebook.
Phases in older projects allow multiple transformation steps to be run within a single workspace. Multiple steps with the same input mapping (sharing data) might save a bit of processing time, but ultimately, everything is less clear and isolated. To save time, run multiple orchestration tasks in parallel instead.
Dependencies allow you to chain transformation steps in older projects. A given transformation is executed after all required steps have been executed.