This part of the tutorial shows how to work with arbitrary data in Python in a completely unrestricted way. Although our examples use the Python language, the very same can be achieved using the R language.
Before you start, you should have a basic understanding of the Python language.
Let’s say you want to experiment with the US unemployment data. It is provided by the U.S. Bureau of Labor Statistics (BLS), and the dataset A-10 contains unemployment rates by month. The easiest way to access the data is via Google Public Data, which contains a dataset called Bureau of Labor Statistics Data.
To work with Google BigQuery, create an account, and enable billing. Remember, querying public data is only free up to 1TB a month. Then create a Google Storage Bucket as a temporary storage for off-loading the data from BigQuery.
Note: If setting up the BigQuery extractor seems too complicated to you, export the query results to Google Sheets and load them from Google Drive. Or, export them to a CSV file and load them from local files.
To create a Google Storage Bucket, go to the Google Cloud Platform console and select Storage.
Create a new bucket:
Enter the bucket’s name and Storage Class (Regional is okay for our use):
Now you’re ready to load the data into KBC. Go to the extractor section, and click New Extractor:
Use the search and find the BigQuery extractor:
Create a new configuration and name it, e.g.,
Then authorize the account:
Name the authorization, and follow the on-screen instructions:
If the authorization is successful, configure the extraction:
Select your Google Project and the bucket you have created (
After that configure the actual extraction queries by clicking the New Query button:
Name the query, e.g.,
unemployment rates and paste the following in the
SQL query field:
LNS14000000 series will pick the unemployment rates only. Uncheck the
Use Legacy SQL setting, and
Save the query configuration:
Now run the configuration to bring the data to KBC:
Running the extractor creates a background job that
When a job is running, a small orange circle appears under Last runs, along with RunId and other info on the job. Green is for success, red for failure. Click on the indicator, or the info next to it, for more details. Once the job is finished, click on the names of the tables to inspect their contents.
To explore the data, go to Transformations, and click on Sandbox. Provided for each user and project automatically, it is an isolated environment in which you can experiment without interfering with any production code.
Click on New Sandbox next to Jupyter:
Select the unemployment rates table (
in.c-keboola-ex-google-bigquery.unemployment-rates), click on Create Sandbox. Wait for the process to finish:
When finished, connect to the web version of the Jupyter Notebook. It allows you to run arbitrary code by clicking the Connect button:
When prompted, enter the password from the Sandbox screen:
/data/in/tables/ is the location for
loaded tables; they
are loaded as simple CSV files. Once your table is loaded, you can play with it:
Now that you can experiment with the U.S. unemployment data extracted from Google BigQuery (or any other data extracted in any other way),
you can do the same with the EU unemployment data. Available at Eurostat, the unemployment
dataset is called
There are a number of ways how to get the data from Eurostat – e.g., you can download it in TSV or XLS format. To avoid downloading the (possibly) lengthy data set to your hard drive, Eurostat provides a REST API for downloading the data. This could be processed using the Generic Extractor. However, the data is provided in JSON-stat format, which contains tables encoded using the row-major method. Even though it is possible to import them to KBC, it would be necessary to do additional processing to obtain plain tables.
To save time, use a tool designed for that – pyjstat. It is a python library which can read JSON-stat data directly into a Pandas data frame. Although this library is not installed by default in the Jupyter Sandbox environment, nothing prevents you from installing it.
Use the following code to download the desired data from Eurostat:
The URL was built using the Eurostat Query Builder.
Also note that installing a library from within the python code must be done using the
method and not the usual
pip install. Now that you have the data, feel free to play with it:
You have just learnt to do a completely ad-hoc analysis of various data sets. If you need to run the above code regularly, simply copy&paste it into a Transformation.
The above tutorial is done in the Python language using the Jupyter Notebook. The same can be done in the R language using RStudio. For more information about sandboxes (including disk and memory limits), see the corresponding documentation.