To create your first Sandbox and see how it is an integral part of the KBC workflow, go to our Getting Started tutorial.
A sandbox is a safe environment for you to
You can fill a sandbox with any data from Storage. However, to simplify transformation development, KBC provides a specific loader for your transformations. It automatically fills the sandbox with relevant tables and takes the Input Mapping of the transformation into account.
Each user has one sandbox per project and backend (Snowflake, Redshift, RStudio for R and Jupyter for Python) to their disposal. Sandboxes with different backends are very similar; but there are few specifics, mostly related to access management — see below.
Important: The backend of the sandbox does not have to match the backend of the original data. For example, you can load data from Snowflake into a Redshift sandbox.
Also, your sandbox might be deleted after 14 days (5 days for RStudio and Jupyter) unless extended; make sure not to use it as a permanent data storage!
Data can be loaded into your sandbox in two different ways:
If you, while developing your transformation, need to add data to the tables already specified in the input mapping, the two ways can be combined. However, in that case, perform the plain load after the transformation load, because the transformation load deletes all data in the sandbox.
Go to the Transformations section and click the Sandbox button in the upper right corner.
If no sandbox exists, create it first by clicking Create Sandbox for the desired backend:
Then click the Load data button to load tables or whole buckets into the sandbox. You can limit the number of rows that are loaded; this is useful for sampling a large table.
The imported tables will have full tables names — including both a bucket and table name.
By default, the sandbox content is deleted before loading new data. Use the Preserve existing data option to keep its content and add new data to it (this option is not available for RStudio and Jupyter sandboxes yet).
This type of loading is intended for gradual development and debugging of your transformation code. Data loaded through the transformation loading is tied to a specific transformation bucket and input mapping. Only data specified in the input mapping is loaded into your transformation sandbox. You can choose whether the transformation itself is performed on the load or not.
To create a sandbox for a specific transformation, go to the Transformations section and select the respective transformation:
The Sandbox backend is defined by the transformation backend. In the transformation detail, click the Create Sandbox button:
Clicking the Create button will get you the connection credentials:
Choose how the data will be loaded and processed:
Once the sandbox is ready, you will get a notification. Or, watch the progress on the Jobs page.
Important: Transformation loading always deletes the contents of the sandbox first.
Except loading data, sandbox supports several other basic actions. To access them, go to the Transformations section and click the Sandbox button at the top.
In the same place you can also see the sandbox connection credentials. To copy & paste individual values, use the copy icon:
Even though sandboxes with different backends are very similar, let’s take a look at a few specifics that are mostly related to access management.
Credentials for Redshift sandboxes are not shared between projects. Each project, including the sandbox, sits on a different cluster. Therefore, all your projects have their own set of credentials.
Almost any PostgreSQL client can connect to an AWS Redshift cluster. We have tested Navicat and DBeaver (free), and they work fine. You can use both a Redshift driver and a PostgreSQL driver.
If using the PostgreSQL driver, do not forget to change the connection port to 5439. For an SSL secure connection, follow this guide. To establish a secure connection to a Redshift sandbox, follow the official instructions from Amazon.
In a Redshift sandbox, you have native access to all Redshift buckets in the project.
You can easily access a table in Storage using schema/bucket namespacing.
SELECT * FROM "in.c-main"."mytable".
Use double quotes, as the schema (= bucket) name always contains a dot.
We do not recommend working with Storage tables directly in your SQL code. Always use the input mapping to bring tables to your schema. This adds another level of security and features to your transformation.
A Redshift sandbox always uses the latest Redshift version available on the cluster.
Important: Currently, this feature is in beta.
Depending on the size of the input tables, the sandbox creation may take some time. You can review its progress in Jobs. When the sandbox is created and you connect to it, you will be taken to a web version of RStudio. The loaded tables will be available at the same locations as in R Transformations. The R version is also the same as in R Transformations.
Note: Although it is possible to upload files directly into the RStudio sandbox, we highly recommend that you use only input mapping to load data into the sandbox. It is a more reliable and traceable method of loading data.
The RStudio Sandbox has the following limitations:
When your R script exceeds the memory limit, it may freeze or crash with the following message:
The previous R session was abnormally terminated due to an unexpected crash.
If it freezes, use the Stop button to stop it:
You can then restart the R session from the menu Session – Restart R. To verify that your script is crashing
on the amount of used memory, you can use the
Important: Currently, this feature is in beta.
The Python environment for Jupyter Notebook is the same as in Python Transformations. To list the available table data files, use the Jupyter file browser or scripts such as this one:
The Jupyter Notebook Sandbox has the following limitations:
When your Python script exceeds the memory limit, it may crash with the following message:
The kernel appears to have died. It will restart automatically.
You’ll then see Dead kernel warning in the top right corner:
If the warning does not go away, you can then restart the kernel manually from the menu Kernel – Restart.
To verify that your script is crashing on the amount of used memory, you can use the
For Jupyter and RStudio sandboxes, code templates can be defined. They can be set for a given user or for the entire project. A Jupyter template is a notebook file (.ipynb). An RStudio template is a simple text file. If a sandbox is loaded from a transformation, the transformation code will be appended after the template code.
To create a project template, go to Storage – Files and upload the template with
_r_sandbox_template_. Don’t forget to mark the file
as permanent. Before uploading the template, make sure it doesn’t contain any sensitive data
(Kernel – Restart & Clear Output might be a good idea). The file name is arbitrary.
To create a user template, upload a file with the tag
_r_sandbox_template_ and add another tag with the value of
the user email.
If there are multiple templates, the newest one is used. A user template always takes precedence over the project template for the given user. If you have both a project and a user template for a given user and you update the project template, the user won’t see the update until their user template is removed.