If you operate Keboola in Bring-Your-Own-Database (BYODB) mode using your own data warehouse, the data in the warehouse won’t automatically be visible or accessible within Keboola. To address this, we offer the External Datasets feature.
The implementation of External Datasets requires the BYODB to be enabled first. Unless specified otherwise, this description refers to the implementation of Snowflake and BigQuery.
Storage in Keboola is organized into buckets. An external dataset is a special type of bucket wherein Keboola does not manage its content. It can be located anywhere in the storage backend used by your Keboola project (Snowflake or BigQuery) and is a virtual bucket connected to a Snowflake schema or BigQuery dataset, respectively.
All table-like objects (such as tables, views, and external tables) inside the schema (in Snowflake) or dataset (in BigQuery) are mapped to tables in the bucket. Access to the bucket is read-only; you cannot write to the bucket from Keboola. A single schema can be registered simultaneously with multiple projects in Keboola.
An external dataset can be registered in the Storage of a project. Go to Storage > Register External Dataset. The dialog will differ based on the backend you are using.
Fill in the name of the new bucket, database name, and schema name. Click Next Step. Keboola will then generate a code that you can use to grant Keboola correct access to the schema in your Snowflake. Once access has been granted, click Register Dataset to start using it.
Note: This set of permissions grants the Keboola service account read-only access to the data.
Fill in the name of the new dataset and dataset name. Click Next Step. Keboola will generate a code that you can use to grant Keboola correct access to the dataset in BigQuery. Once access has been grated, click Register Dataset to start using it.
Note: By adding the Keboola service account as a subscriber, you enable read-only access to the data.
Keboola generaly does not support external tables, except for BigLake tables. Please ensure that any table you are using is of this type. External tables of other types will not work in transformations and workspaces due to permission issues.
Please ensure that you can perform a SELECT * FROM <table> LIMIT 1
query on your created BigLake table. Keboola checks this during the registration process.
If the SELECT
fails, the table will be skipped.
The only exceptions are tables configured with require_hive_partition_filter=true
. Such configurations of BigLake tables are supported by Keboola, but
SELECT
operations (like Data Preview) will fail. This is expected behavior. You can still use these tables in your workspaces and transformations, but
appropriate WHERE
conditions are necessary.
When you register an external dataset, we analyze the metadata of the objects in it and register all tables and views as tables in the Keboola Storage bucket. If you later add additional tables or views, you must manually refresh the Storage bucket using the Refresh action in the bucket detail to make them visible in Keboola.
External datasets cannot be used in an input mapping They are accessible via the read-only input mapping. Keep in mind that external buckets cannot be used in an output mapping as they are not writable.
External datasets cannot be used in an input mapping as they are not copied into your transformation workspace. You need to reference them in your transformation using a fully qualified name.
In the following example, it is assumed that you’ve created an external dataset called users-reporting
that references the sales_emea
schema in the database
REPORTING
. The schema contains a table called users
. Now you want to create a new table MQL_USERS
that contains only users sourced from
marketing-qualified leads. You can do that using the following SQL:
CREATE TABLE "MQL_USERS" AS SELECT *
FROM "REPORTING"."sales_emea"."users"
WHERE "source" = 'mql';
Note: The query uses the fully qualified name of the table in the FROM
clause.
For BigQuery, an external dataset is mapped to an actual dataset, users_reporting
(the name you filled in the dialog), in your project—in this case, project sapi-9752
. You can reference the contents of the dataset in your SQL transformation using a fully qualified name:
CREATE TABLE `MQL_USERS` AS SELECT *
FROM `users_reporting`.`users`
WHERE `source` = "mql";
Note: The dataset name is the name of the bucket you provided in the dialog (users-reporting
), not the original dataset created in BigQuery
(sales_emea
). However, there are no technical limitations; they can have the same name.
It is possible to share a Snowflake external bucket using the same process as any other Storage bucket. Once the bucket is shared, the refresh operation is only available in the source project. Currently, it is possible to share entire buckets, not specific tables within them.
Note: At this moment, it is not possible to share BigQuery or BigLake datasets. If this is a relevant use case for you, please create a support ticket.
Removing an external dataset is as simple as removing any other Storage bucket. Simply delete it in the UI or via API. The Storage bucket will be removed from the project, but the schema in the database will remain untouched. Any rights that you have granted to Keboola during the registration will be revoked.
If you wish to remove the schema, you must do so manually in your warehouse.
GRANT SELECT ON FUTURE TABLES IN SCHEMA "REPORTING"."sales_schema" TO ROLE KEBOOLA_8_RO