This data flow template provides you with a set of Snowflake SQL procedures for the testing and monitoring of data quality.
Shared codes for data quality testing represent a set of Snowflake SQL procedures
that can be used in a Snowflake SQL transformation to process various data tests and to log the test results into a unified table named DQ_RESULTS_LOG
.
This set of procedures includes one that can conditionally ABORT transformation processing. This is useful when you want the processing to stop following a failed test, avoiding the spending of unnecessary credits and preventing the loading of corrupted data into your transformation outputs.
The flow, in a nutshell:
First we will create a set of shared codes – Snowflake SQL procedures for data quality tests and the logging of test results.
We will then create dummy datasets using a Python transformation.
Snowflake SQL transformations will use the generated dummy datasets to demonstrate how you can use the SQL procedures in your Snowflake SQL transformations.
Finally, data quality tests will produce a result log output table to be used for continuous monitoring of the results.
These are the available tests:
ID | DESCRIPTION |
---|---|
TEST_COLUMN_NULL_OR_EMPTY | Tests for presence of Null (or empty string) values in a given column in a given table. Fails when there are more than 0 null values. |
TEST_COLUMN_NULL | Tests for presence of Null values in a given column in a given table. Fails when there are more than 0 null values. |
TEST_COLUMN_UNIQUE | Tests whether all records within a given set of columns in a given table are unique. Fails when there are more than 0 duplicate records. |
TEST_TIME_SERIES_COMPLETE | Tests whether there are more than X records in a table for each datepart (day, hour, …) in time series defined by date column. Fails when there are dates with fewer records than expected. |
TEST_TIME_SERIES_COMPLETE_RANGE | Tests whether there are more than X records in a table for each datepart (day, hour, …) in time series defined by date column in a provided range. Fails when there are dates with fewer records than expected. |
TEST_COLUMN_FOREIGN_REF | Tests whether each record’s reference key in a table matched a reference record in a given reference table. Fails when there are more than 0 records with no match. |
TEST_SUM_IN_GROUPS_EQUAL | Tests whether the sum of values in a given column in a group (optional) in a table is the same in two tables (groups in both tables can differ). Fails when the sum differs by more than 0. |
TEST_ANOMALY_NUMERIC | Tests anomalies in numeric columns based on a formula (mean() - value) > 2.5 * std()). In words - if a value in a column differs from the “average” value by more than a standard deviation, it’s an anomaly. |
TEST_COLUMN_DATA_TYPE | Tests whether a data type of a given column in a given table is of an expected data type |
TEST_COLUMN_VALUE_DATA_TYPE | Tests whether values in a given column in a given table are of an expected data type. Returns values that don’t match an expected data type. Useful for tables without defined data types. |
TEST_VALUE_CONTAIN | Tests whether a value in a given column in a given table contains a given string. Fails when a value doesn’t contain a given string. |
TEST_VALUE_NOT_CONTAIN | Opposite of test_value_contain. Fails when a value DOES contain a given string. |
TEST_VALUE_BEGIN | Tests whether a value in a given column in a given table begins with a given string. Fails when a value doesn’t begin with a given string. |
TEST_VALUE_NOT_BEGIN | Opposite of test_value_begin. Fails when a value DOES begin with a given string. |
TEST_VALUE_END | Tests whether a value in a given column in a given table ends with a given string. Fails when a value doesn’t end with a given string. |
TEST_VALUE_NOT_ENDS | Opposite of test_value_end. Fails when a value DOES end with a given string. |
TEST_VALUE_REGEXP | Tests whether a value in a given column in a given table matches a given regular expression. Fails when a value doesn’t match with a given regexp. |
TEST_VALUE_NOT_REGEXP | Opposite of test_value_regexp. Fails when a value DOES match with a given regexp. |
TEST_TABLE_COMPARE_STRUCTURE | Tests the structure of two tables. Compares a list of Column names (and data types) in both tables and fails if they differ. |
TEST_TABLE_COMPARE_RECORDS | Tests records in two tables. Compares all records and fails if they differ. Outputs the diff. |
TEST_TABLE_EMPTY | Test if a table is empty (contains zero records). If there are zero records, test fails. |
TEST_VALUE_EQUAL | Tests that each value in a given column in a given table equals a given value. Fails when a value is not equal to a given value. |
TEST_VALUE_NOT_EQUAL | Opposite of test_value_equal. Fails when a value IS equal to a given value. |
TEST_VALUE_GREATER_THAN | Tests that each value in a given column in a given table is greater than a given value. Fails when a value is NOT greater than a given value. |
TEST_VALUE_LOWER_THAN | Tests that each value in a given column in a given table is lower than a given value. Fails when a value is NOT lower than a given value. |
TEST_VALUE_IN_RANGE | Tests that each value in a given column in a given table is between two given values. Fails when a value is NOT in a given range. |
TEST_VALUE_IN_SET | Tests whether all values in a given column in a given table are in a given set of values. Fails when a value is not in a given set of values. |
Column Name | Description | Sample Value |
---|---|---|
ID | Unique identifier of a performed test | 1d8fb4e8c486829db3387e429e0cb8a0 |
EXECUTION_TIME | Timestamp of an executed test | Mon, 23 Jan 2023 13:58:34 +0100 |
COMPONENT_ID | Component identifier | keboola.snowflake-transformation |
CONFIGURATION_ID | Unique identifier of a component configuration | 943845068 |
RUN_ID | Run ID (from job log) | 943845595 |
JOB_URL | Job URL | https://connection.keboola.com/admin/projects/9389/queue/943845595 |
TEST_ID | Identifier of a test | TEST_COLUMN_FOREIGN_REF |
TEST_QUERY | Particular test query showing which attributes were used for a test | TEST_COLUMN_FOREIGN_REF(‘orders’, ‘customers’, ‘order_id’, ‘pk’);…. |
TEST_NAME | Name of a test | TEST_COLUMN_FOREIGN_REF |
TEST_RESULT_VALUE | Result value - present if a test fails. Shows what values caused the test to fail. | [{“OCCURENCES”:0,”OFFENDER”:”61”} |
TEST_PARAMETERS | Detail of TEST_QUERY parameters | {“TABLE_NAME_MAIN” :”orders”, “COLUMN_NAME”: “order_id”, “TABLE_NAME_REF”: “customers”, “COLUMN_NAME_REF”: “pk”} |
TEST_RESULT | Result of a test (success/error) | error |
TEST_LEVEL | Level of a test (WARNING/FAIL). Fail test level should be used for aborting a transformation/flow right after such error fails. | WARNING |
The process is simple. You don’t need to configure anything. Just add the template to your project and then run it. Shared codes will be added to your project, and all steps of the flow will be executed in order to generate a sample SQL transformation, using the data tests and generating the result log table.
Select the corresponding template from the Templates tab in your Keboola project. When you are done, click + Use Template.
This page contains information about the template. Click + Use Template again.
You’ll be asked to write a name for the template instance you are about to create. You can use the template as many times as you want and still keep everything organized.
After clicking Next Step, you will see the template builder. You do not need to edit any configuration.
When you are finished, click Save in the top right corner. The template builder will create your new configuration and when it is done you will see the newly created flow.
Click Run Template.
Go to Transformations. You will find the newly added shared codes under the Shared Codes tab. The key shared code is the Data Quality Core. You will also find a new transformation Data Quality Core - Full Example, which shows how the shared codes can be used.
The transformation creates a new table in your Storage named DQ_RESULTS_LOG.
Important: Add the DQ_RESULT_LOG table to output mappings of each transformation for which you use the Data Quality Core shared code.
You will also find a transformation named Data Quality Core - ABORT/FAIL Example that demonstrates how to use the FAIL level of tests to ABORT a transformation in case such a test fails. It also uses the write_always parameter of output mapping in order to write the output even on a transformation error.