R is designed for advanced statistical computations. Apart from ready-to-use implementations of state-of-the-art algorithms, another great asset of R are vector and matrix computations. R transformations complement Python and SQL transformations (MySQL or Redshift) where computations or other operations are too difficult. Common operations with data like joining, sorting, and grouping are still easier and faster to do in SQL Transformations.
The R script is running in an isolated Docker environment. The current R version is R 3.3.2.
The Docker container running the R transformation has allocated 8GB of memory and the maximum running time is 6 hours.
The R script itself will be compiled to
/data/script.R. To access input and output tables, use relative (
or absolute (
/data/out/tables/file.csv) paths. To access downloaded files, use
If you want to dig really deep, have a look at the full Common Interface Specification.
Temporary files can be written to the
/tmp/ folder. Do not use the
/data/ folder for
files you do not wish to exchange with KBC.
The R script to be run within our environment must meet the following requirements:
The R transformation can use any package available on
CRAN. To install the package, list
its name in the package section that will automatically install the package and its dependencies.
The package will be loaded automatically as well, so you do not have to load using
library(). Even though it does not hurt.
The latest versions of packages are always installed.
Tables from Storage are imported to the R script from CSV files. The CSV files can be read by standard R functions.
Generally, the table can be read with default R settings. In case R gets confused, use the exact format
sep=",", quote="\"". For example:
Do not use the row index in the output table (
The row index produces a new unnamed column in the CSV file which cannot be imported to Storage.
We have set up our environment to be a little zealous; all warnings are converted to errors and they cause the transformation to be unsuccessful.
If you have a piece of code in your transformation which may emit warnings, and you really want to ignore them, wrap the code in a
To develop and debug R transformations, you can replicate the execution environment on your local machine. To do so, you need to have R installed, preferably the same version as us. It is also helpful to use an IDE, such as RStudio.
To simulate the input and output mapping, all you need to do is create the right directories with the right files. The following image shows the directory structure:
The script itself is expected to be in the
data directory; its name is arbitrary. It is possible to use relative directories,
so that you can move the script to KBC transformation with no changes. To develop a Python transformation which takes
a sample CSV file locally, take the following steps:
in/tablessubdirectory of the working directory.
in/usersubdirectory of the working directory, and make sure that their name is without any extension.
Use this sample script:
A complete example of the above is attached below in data.zip. Download it and test the script in your local R installation.
result.csv output file will be created. This script can be used in your transformations without any modifications.
All you need to do is
source.csv(expected by the R script),
result.csv(produced by the R script) to a new table in your Storage,
The above steps are usually sufficient for daily development and debugging of moderately complex R transformations, although they do not reproduce the transformation execution environment exactly. To create a development environment with the exact same configuration as the transformation environment, use our Docker image.
There are more in-depth examples dealing with