DuckDB is an in-process analytical database designed for fast SQL analytics. It brings several advantages to Keboola transformations:
Note: DuckDB Transformation is currently in BETA. Breaking changes may occur.
To create a new DuckDB transformation, click New Transformation in the Transformations section and select DuckDB Transformation.

Name your transformation, optionally add a description and folder, and click Create Transformation.

The configuration page allows you to set up input/output mappings, write SQL queries, and configure transformation settings.

On the right side panel, you can configure:
You can select the DuckDB version used to run the transformation. Use latest (default) to always run on the most
recent supported version, or pin to a specific version (e.g., 1.5.2, 1.4.4) for stability. Each supported version
runs in its own isolated environment.

DuckDB transformations use block-based orchestration for organizing and executing SQL code:
This means you can organize your transformation into logical blocks and let the system handle parallel execution where possible.
DuckDB transformations provide four sync actions for debugging and visualization without running the full transformation:
syntax_check) — validates your SQL syntax without executing any queries. Useful for catching errors before running the transformation.lineage_visualization) — generates a markdown diagram of data dependencies, showing how tables flow through your transformation.execution_plan_visualization) — shows the planned execution order with blocks and batches, illustrating how the automatic DAG organizes your queries.expected_input_tables) — displays the list of input tables that the transformation expects based on the SQL analysis.These actions are available from the transformation configuration page and are helpful for understanding and debugging complex transformations.
You can change the backend size to allocate more memory for your transformation. The following sizes are available:
| Backend Size | Memory | Recommended For |
|---|---|---|
| XSmall | 8 GB | Small datasets, testing |
| Small (default) | 16 GB | Most use cases |
| Medium | 32 GB | Large datasets (5 GB+) |
| Large | 113.6 GB | Very large datasets (10 GB+) |
Start with the Small backend and scale up as needed based on your dataset size and query complexity.
Note: Dynamic backends are not available if you are on the Free Plan (Pay As You Go).
DuckDB automatically detects the available CPU and memory resources. You can also manually configure resource limits using the threads and max_memory_mb parameters in the transformation configuration.
By default, input tables are loaded as CSV files. You can enable Parquet format for significantly better performance, especially with larger datasets.
Advantages of Parquet:
Recommendation: Always use Parquet for datasets larger than 1 GB.
To enable Parquet, toggle the Use parquet for input tables option in the transformation settings.
When working with non-typed (string-based) Storage tables, you can enable the Infer input table data types option. This feature instructs DuckDB to infer the actual data types of the input columns, so you can work with numeric, date, and boolean types directly in your SQL queries without manual casting.

Why is this useful?
Keboola Storage tables can be non-typed (all columns stored as VARCHAR). Without type inference enabled,
all values in input tables are treated as strings, and functions like SUM() will fail because they expect numeric types.

With Infer input table data types enabled, DuckDB automatically detects the correct types (e.g., INTEGER, FLOAT, DATE),
so aggregate functions and type-specific operations work as expected.

The output table then contains properly typed columns:

To create a simple DuckDB transformation, follow these steps:
sample (as expected by the DuckDB script).CREATE TABLE "output" AS
SELECT "order_date", SUM("order_amount") AS "sum_orders_amount"
FROM "sample"
GROUP BY "order_date";
You can organize the script into blocks.
Each SQL statement in a DuckDB transformation must be terminated with a semicolon (;). If you have multiple statements
in a single script, make sure they are properly separated:
-- Correct: each statement ends with a semicolon
CREATE TABLE "output_a" AS SELECT * FROM "input_a";
CREATE TABLE "output_b" AS SELECT * FROM "input_b";Missing semicolons will cause syntax errors.
DuckDB handles identifier case differently than Snowflake:
Table names:
SELECT * FROM MyTable references mytable).SELECT * FROM "MyTable" references exactly MyTable).Column names:
SELECT columnName and SELECT ColumnName refer to different columns).This is different from Snowflake, where unquoted identifiers become uppercase.
Best practices:
"TaBlE-stage".Filter and project early — apply WHERE clauses as close to the source table as possible and select only the columns you need.
This reduces the amount of data DuckDB needs to scan.
-- Good: filter and project at the source
SELECT id, name, price
FROM products
WHERE category = 'electronics' AND price > 100;Use EXPLAIN for performance analysis — prefix your query with EXPLAIN to see the execution plan and identify expensive operations.
EXPLAIN SELECT product_category, SUM(price) AS total_revenue
FROM sales
WHERE sale_date >= '2023-01-01'
GROUP BY product_category
ORDER BY total_revenue DESC;DuckDB provides several quality-of-life SQL extensions that simplify common patterns:
GROUP BY ALL — automatically groups by all non-aggregated columns:
SELECT product, category, SUM(sales)
FROM orders
GROUP BY ALL;EXCLUDE — select all columns except specific ones:
SELECT * EXCLUDE (password, ssn, credit_card)
FROM users;ASOF JOIN — useful for time-series data where timestamps do not match exactly:
SELECT
s.player_id,
s.score,
s.score_time,
w.temperature,
w.conditions
FROM scores s
ASOF JOIN weather w
ON s.score_time >= w.timestamp;SUMMARIZE — quick data profiling with min, max, null percentage, and unique counts:
SUMMARIZE SELECT * FROM my_table;Keboola Storage tables store data in character types by default. When Infer input table data types is disabled,
all columns are loaded as VARCHAR. You need to cast values explicitly:
CREATE TABLE "result" AS
SELECT
CAST("amount" AS DECIMAL) AS "amount",
CAST("created_at" AS TIMESTAMP) AS "created_at"
FROM "source";When Infer input table data types is enabled, DuckDB automatically infers the correct types and you can use them directly.
For datasets larger than 10 GB, configure DuckDB to use on-disk processing with PRAGMA settings:
PRAGMA memory_limit='8GB';
PRAGMA temp_directory='/tmp/duckdb_temp';
PRAGMA threads=4;
PRAGMA enable_object_cache;stg_customers, fact_orders, dim_products).DuckDB is an OLAP (Online Analytical Processing) database optimized for SELECT statements and analytical queries.
Avoid workflows with frequent INSERT and UPDATE operations. For transactional workloads, use a different backend such as Snowflake.
The following example shows a typical DuckDB transformation processing CRM data (e.g., from HubSpot). It demonstrates
common patterns: TRY_CAST for safe type conversion, NULLIF for handling empty strings, and :: for type casting.
/* companies */
CREATE TABLE "out_companies" AS
SELECT
"companyId",
"name",
"website",
TRY_CAST(NULLIF("createdate", '') AS DATE) AS "createdate",
"isDeleted"::BOOLEAN AS "isDeleted"
FROM "companies";
/* contacts */
CREATE TABLE "out_contacts" AS
SELECT
"canonical_vid",
"firstname",
"lastname",
"email",
TRY_CAST(NULLIF("createdate", '') AS DATE) AS "createdate",
"hs_analytics_source" AS "email_source",
"associatedcompanyid",
"lifecyclestage"
FROM "contacts";
/* deals */
CREATE TABLE "out_deals" AS
SELECT
"dealId",
"isDeleted"::BOOLEAN AS "isDeleted",
"dealname",
TRY_CAST(NULLIF("createdate", '') AS DATE) AS "createdate",
TRY_CAST(NULLIF("closedate", '') AS DATE) AS "closedate",
"dealtype",
TRY_CAST(NULLIF("amount", '') AS DOUBLE) AS "amount",
"pipeline",
"dealstage",
"hubspot_owner_id",
"hs_analytics_source"
FROM "deals";
/* pipeline stages */
CREATE TABLE "out_stages" AS
SELECT
"stageId",
"label",
TRY_CAST(NULLIF("displayOrder", '') AS INT) AS "displayOrder",
TRY_CAST(NULLIF("probability", '') AS DOUBLE) AS "probability",
"closedWon"::BOOLEAN AS "closedWon"
FROM "pipeline_stages";Key patterns used:
TRY_CAST(NULLIF("column", '') AS TYPE) — safely converts empty strings to NULL before casting. This avoids errors when the source data contains empty values."column"::BOOLEAN — shorthand type cast syntax.;) — required when multiple statements are in a single script.Choose DuckDB for:
Choose Snowflake for:
If you are migrating existing Snowflake transformations to DuckDB, see the detailed Snowflake to DuckDB Migration Guide.