Data Types

Some components, especially data source connectors, store metadata about the table columns. For example, when a DB data source connector loads a table from a source database, it also records the physical column types from that table. These are stored with each table column and can be used later when working with the table. For instance, transformation COPY mapping allows you to set data types for the tables inside the transformations. Also, some data destination connectors, e.g., the Snowflake data destination connector, use the table metadata to pre-fill the table columns configuration for you.

Even if a data type is available for a column, Storage always internally creates all table columns as text, not null, and nullable values are converted to empty strings (except for Exasol, where everything is null). Remember this, especially in transformations, where the output is always cast to text. This behavior can sometimes be changed with the native data types feature. The non-text column type is used only during a component (transformation or data destination connector) execution. The basic idea behind this is that a text type has the best interoperability, so this averts many issues (e.g., some date values stored in a MySQL database might not be accepted by a Snowflake database and vice-versa).

Base Types

Source data types are mapped to a destination using a base type. The current base types are STRING, INTEGER, NUMERIC, FLOAT, BOOLEAN, DATE, and TIMESTAMP. This means that, for example, a MySQL data source connector may store the value BIGINT as a type of column; that type maps to the INTEGER base type. When the Snowflake data destination connector consumes this value, it will read the base type INTEGER and choose a corresponding type for Snowflake, which happens to be also INTEGER. This ensures high interoperability between the components. Please take a look at the conversion table below.

View the extracted data types in the storage table detail:

Screenshot - View Column Data Type

You can also override the data type:

Screenshot - Set Column Data Type

When you use the table (e.g., in the Snowflake data destination connector), you’ll see the data type you have configured:

Screenshot - Set Column Data Type

Note that the column type setting is, in all cases, only a metadata setting. It does not affect the actual stored data. The data is converted only when writing or copying (e.g., to a transformation or a data destination connector). That means that you can extract an integer column, mark it as a timestamp in Storage and write it as an integer into a target database (though you’ll be offered to write it as a timestamp).

You access both the source and base type metadata through the corresponding API.

Data Type Conversions

As described above, the source data type is converted to a base data type stored in metadata storage. The base type is then converted to the target data type. The following tables show mappings for each base type. The mapping causes possible information loss (e.g., assigning SMALLINT to INTEGER). To minimize this, we also keep track of the data type size and transfer that if possible. For example, a SMALLINT column would be stored as base type INTEGER with size 2. If the target database supports integer sizes, you will be offered to set the type in the target database as INTEGER(2).

STRING

Base type STRING represents any textual type; both CHARACTER VARYING (or VARCHAR) and TEXT types are included. Also, the string base type is used for any other unrecognized type on input. It means that the source type column is not an exhaustive list in the following table. It’s a list of suitable string types converted to a string. All other unknown types are converted to a string as well.

Source Source Type Base Type Target Type Target
Generic char STRING N/A
character varying
text
varchar
STRING Hive
STRING Impala
TEXT MS SQL Server
MySQL CHAR VARCHAR MySQL
TEXT
VARCHAR
VARCHAR2 Oracle
VARCHAR PostgreSQL
Redshift BPCHAR VARCHAR Redshift
CHAR
CHARACTER
CHARACTER VARYING
NCHAR
NVARCHAR
TEXT
VARCHAR
VARCHAR SiSense
Snowflake BINARY VARCHAR Snowflake
CHAR
CHARACTER
STRING
TEXT
VARBINARY
VARCHAR
Synapse BINARY NVARCHAR Synapse
CHAR
NCHAR
NVARCHAR
VARBINARY
VARCHAR
VARCHAR Thoughtspot
Source Source Type Base Type Target Type Target

INTEGER

The INTEGER base type represents data types for whole numbers.

Source Source Type Base Type Target Type Target
Generic bigint INTEGER
bigserial
mediumint
smallint
int
int2
int4
int64
int8
integer
serial8
tinyint
INT Hive
INT Impala
BIGINT MS SQL
MySQL BIGINT INTEGER MySQL
INT
INTEGER
MEDIUMINT
SMALLINT
TINYINT
N/A Oracle
INTEGER Postgres
Redshift BIGINT INTEGER Redshift
INT
INT2
INT4
INT8
INTEGER
SMALLINT
BIGINT SiSense
Snowflake BIGINT INTEGER Snowflake
BYTEINT
INT
INTEGER
SMALLINT
TINYINT
Synapse BIGINT INT Synapse
INT
SMALLINT
TINYINT
INT Thoughtspot
Source Source Type Base Type Target Type Target

NUMERIC

The NUMERIC base type represents fixed-point fractional numbers (real, numeric or decimal data types).

Source Source Type Base Type Target Type Target
Generic dec NUMERIC
decimal
fixed
money
number
numeric
smallmoney
DECIMAL Hive
DECIMAL Impala
DECIMAL MS SQL Server
MySQL DEC NUMERIC MySQL
DECIMAL
FIXED
NUMERIC
NUMBER Oracle
NUMERIC PostgreSQL
Redshift DECIMAL NUMERIC Redshift
NUMERIC
DECIMAL SiSense
Snowflake DECIMAL NUMBER Snowflake
NUMBER
NUMERIC
Synapse NUMERIC NUMERIC Synapse
DECIMAL
N/A Thoughtspot
Source Source Type Base Type Target Type Target

FLOAT

The FLOAT base type represents floating-point fractional numbers (float or double data types).

Source Source Type Base Type Target Type Target
Generic binary_double FLOAT
binary_float
double
double precision
d_float
float
float4
float8
quad
real
FLOAT Hive
FLOAT Impala
FLOAT MS SQL Server
MySQL DOUBLE FLOAT MySQL
DOUBLE PRECISION
FLOAT
REAL
N/A Oracle
REAL PostgreSQL
Redshift DOUBLE PRECISION FLOAT Redshift
FLOAT
FLOAT4
FLOAT8
REAL
FLOAT SiSense
Snowflake FLOAT DOUBLE Snowflake
DOUBLE PRECISION
FLOAT
FLOAT4
FLOAT8
REAL
Synapse FLOAT FLOAT Synapse
REAL
FLOAT Thoughtspot
Source Source Type Base Type Target Type Target

BOOLEAN

The BOOLEAN base type represents a true or false value.

Source Source Type Base Type Target Type Target
Generic bool BOOLEAN
boolean
BOOLEAN Hive
BOOLEAN Impala
BIT MS SQL Server
N/A MySQL
N/A Oracle
BOOLEAN PostgreSQL
Redshift BOOL BOOLEAN Redshift
BOOLEAN
BIT SiSense
Snowflake BOOLEAN BOOLEAN Snowflake
Synapse BIT BIT Synapse
BOOL Thoughtspot
Source Source Type Base Type Target Type Target

DATE

The DATE base type represents a date value without a time portion.

Source Source Type Base Type Target Type Target
Generic date DATE DATE
N/A Hive
N/A Impala
DATE MS SQL Server
MySQL DATE DATE MySQL
DATE Oracle
DATE PostgreSQL
Redshift DATE DATE Redshift
DATE SiSense
Snowflake DATE DATE Snowflake
Synapse DATE DATE Synapse
DATE Thoughtspot
Source Source Type Base Type Target Type Target

TIMESTAMP

The TIMESTAMP base type represents a date value with a time portion.

Source Source Type Base Type Target Type Target
Generic datetime TIMESTAMP
datetime2
datetimeoffset
smalldatetime
timestamp
timestamptz
timestamp_LTZ
timestamp_NTZ
TIMESTAMP_TZ
timestamp with local time zone
timestamp with time zone
timestamp without time zone
TIMESTAMP Hive
TIMESTAMP Impala
DATETIME2 MS SQL Server
MySQL DATETIME TIMESTAMP MySQL
TIMESTAMP
TIMESTAMP Oracle
TIMESTAMP PostgreSQL
Redshift TIMESTAMP TIMESTAMP Redshift
TIMESTAMPTZ
TIMESTAMP WITH TIME ZONE
TIMESTAMP WITHOUT TIME ZONE
N/A SiSense
Snowflake DATETIME TIMESTAMP Snowflake
TIMESTAMP
TIMESTAMP_NTZ
TIMESTAMP_LTZ
TIMESTAMP_TZ
Synapse DATETIMEOFFSET DATETIMEOFFSET Synapse
DATETIME
DATETIME2
SMALLDATETIME
TIME
TIMESTAMP Thoughtspot
Source Source Type Base Type Target Type Target

Native Data Types

Specific behavior depends on the backend of your project. We’ll be using the Snowflake backend as an example.

As mentioned above, Keboola stores data in Storage as text (VARCHAR NOT NULL) by default. With native types, data is stored in columns with an actual data type (DATETIME, BOOLEAN, DOUBLE, etc.) based on Keboola metadata.

Tables with native data types are labeled in the user interface with a badge:

Screenshot - Table with native datatypes

How to Create a Typed Table

Manually via an API

A table with a type definition is created using the tables-definition endpoint, and data is loaded into it. Data types used in this endpoint have to correspond with the storage backend which your project uses. Alternatively, you can use base types.

Output mapping of a component

A component may provide information about column data types in its data manifest. Database data source connectors and transformations matching the storage backend (e.g., Snowflake SQL transformation on the Snowflake storage backend) will create storage tables with the same types. The database data source connectors and transformations that do NOT match the backend will create storage tables using base types.

Note: When a table is created from base types, it defaults to the lengths and precisions specific for each Storage backend. For instance, in Snowflake, the NUMBER base type is created as NUMBER(38,9), which may be unexpected if the source database column is NUMBER(10,2).

To avoid this limitation, you can manually create the table in advance using the Table Definition API with the correct precisions. When subsequent jobs write data to this table, they will respect your definition as long as it matches. Remember this when dropping and recreating tables. If a job creates a table, it will default to the incorrect type based on the base type.

For example, here’s how to create typed tables in a Snowflake SQL transformation, ensuring they are imported to Storage as typed tables:

-- create a table with datatypes
CREATE OR REPLACE TABLE "typed_table" (
    "id" NUMBER,
    "name" VARCHAR(255),
    "created_at" TIMESTAMP_NTZ
);

-- insert some data
INSERT INTO "typed_table"
VALUES
    (1, '75', '2020-01-01 00:00:00');

-- create another table with datatypes based on an existing table
CREATE OR REPLACE TABLE "typed_table_2" AS
SELECT
    "id"::varchar AS "string_id",
    "name"::number AS "numeric_name",
    "created_at"::date AS "typed_date"
FROM
    "typed_table";

Note: The data type hinting is the components’ responsibility, so components must be updated by their respective authors to support this. The database data source connectors that are maintained by Keboola already provide data types. There is no list of components that support this feature. You may check the component’s documentation to see if it supports native data types.

How to Define Data Types

Using actual data types of the storage backend

For example, in the case of Snowflake, you can create a column of type TIMESTAMP_NTZ or DECIMAL(20,2). This allows you to specify all the data type details, for instance, including precision and scale. But it’s tied to the specific storage backend, and thus it’s not portable.

An example of such a column definition in a table-definition API endpoint call is as follows:

{
  "name": "id",
  "definition": {
    "type": "DECIMAL",
      "length": "20,2",
      "nullable": false,
      "default": "999"
  }
}

Using Keboola-provided base types

Specifying native types using base types is ideal for component-provided types as they are storage backend agnostic. However, they can be used for the table-definition API endpoint as well. The definition is as follows:

{
  "name": "id",
  "basetype": "NUMERIC"
}

Changing Types of Existing Typed Columns

You can’t change the type of column of a typed table once it has been created. There are multiple ways to work around this.

First, if the table is loaded using full load, you can drop the table and create a new table with the correct types and load the data there.

If the table is loaded incrementally, you must create a new column and copy the data from the old one.

  • You have a column date of type VARCHAR in a typed table, and you want to change it to TIMESTAMP.
  • You first add a new column date_timestamp of type TIMESTAMP to the table.
  • Then you change all the jobs filling the table to fill the new and old columns.
  • Then you run an ad-hoc transformation, which will copy data from date to date_timestamp in the existing rows.
  • Then you can slowly change all the places where date is used to use date_timestamp instead.
  • When you only use the new column, the old one can be removed.

In both cases, check all the other configurations using the table to avoid any schema mismatch. This is especially important for data destination connectors, where a table exists in the destination.

Incremental Loading

When you load data incrementally, there is a difference between typed and non-typed tables. Typed tables only compare the columns of the table’s primary key, while non-typed tables compare the whole row, updating rows where any value changes. This is described in detail in our documentation on incremental loading.

Handling NULLs

Columns without native types are always VARCHAR NOT NULL. This means you don’t need to care about a specific NULL behavior. This changes with typed columns. In most databases, NULL does not equal NULL (NULL == NULL is not TRUE, but NULL). This breaks the incremental loading flow where columns are compared against each other.

For this reason, please make sure that your primary key columns are not nullable. This is most relevant in CTAS queries, where columns are nullable by default. To work around this, specify the columns as part of the CTAS expression. For example:

CREATE TABLE "ctas_table" (
    "id" NUMBER NOT NULL,
    "name" VARCHAR(255) NOT NULL,
    "created_at" TIMESTAMP_NTZ NOT NULL
) AS SELECT * FROM "typed_table";

Pros and Cons

  • Pros
    • Loading into a workspace is significantly faster than loading into a table without native data types. Casting data is not necessary when loading into a workspace.
    • A table accessed in a workspace via the read-only input mapping already has typed columns.
    • Data types are strictly enforced, ensuring that data in a specific column (like a number column) is consistent with its type.
  • Cons
    • Changing a column type is complicated; see How to Change Column Types.
    • Keboola does not perform any type conversion during loading. Your data must exactly match the column type in the table in Storage.
    • Any load of data with incompatible types will fail.
    • The filtering option in the input mapping section is unavailable for tables with defined data types. If filtering is crucial for your workflow, consider using SQL, Python, or even no-code transformations to filter the data and create a new filtered table.

How to Create a Typed Table Based on a Non-Typed Table

For example, we have a non-typed table non_typed_table with the following definition:

Non-typed table schema

And with the following data:

Sample data

To create a typed table based on non_typed_table, create a new transformation, choose table input mapping non_typed_table (or you can rely on read-only input mapping) and choose table output mapping typed_table. The output table must not exist; otherwise, it will not be a typed table.

Create transformation

In a queries section, add an SQL query transforming column types. In this step, you should provide proper casting for your data. In the following example, you can see the custom formatting of the date.

CREATE TABLE "typed_table" AS
    SELECT 
        CAST(ntt."id" AS VARCHAR(64)) AS "id",
        CAST(ntt."id_profile" AS INTEGER) AS "id_profile",
        TO_TIMESTAMP(ntt."date", 'DD.MM.YYYY"T"HH24:MI:SS') AS "date",
        CAST(ntt."amount" AS INTEGER) AS "amount"
    FROM "non_typed_table" AS ntt;

Run the transformation and wait until it finishes.

The newly created table typed_table schema should look like this:

Typed table schema

You can see the NATIVE TYPES label after the table name, which means that the table is typed. Table columns should have the same data types as in the transformation query.

If the destination table already exists, and you want to keep the same name, you must first rename the original table (e.g., non_typed_table_bkp). Then, create a new table using the transformation described above.

Note that incremental loading cannot be used in this case.