Part 5 - Ad-hoc Data Analysis

After you have loaded your tables, either manually or using an extractor, manipulated the data in SQL, written it into Tableau BI or into GoodData BI, and set everything to run automatically, let’s take a look at some additional KBC features related to doing ad-hoc analysis.

This part of the tutorial shows how to work with arbitrary data in Python in a completely unrestricted way. Although our examples use the Python language, the very same can be achieved using the R language.

Before you start, you should have a basic understanding of the Python language.

Introduction

Let’s say you want to experiment with the US unemployment data. It is provided by the U.S. Bureau of Labor Statistics (BLS), and the dataset A-10 contains unemployment rates by month. The easiest way to access the data is via Google Public Data, which contains a dataset called Bureau of Labor Statistics Data.

Google Public Data can be queried using BigQuery and brought into Keboola Connection (KBC) with the help of our BigQuery extractor. Preview the table data in Google BigQuery.

Using BigQuery Extractor

To work with Google BigQuery, create an account, and enable billing. Remember, querying public data is only free up to 1TB a month. Then create a Google Storage Bucket as a temporary storage for off-loading the data from BigQuery.

Note: If setting up the BigQuery extractor seems too complicated to you, export the query results to Google Sheets and load them from Google Drive. Or, export them to a CSV file and load them from local files.

Preparing

To create a Google Storage Bucket, go to the Google Cloud Platform console and select Storage.

Screenshot - Google Cloud Platform

Create a new bucket:

Screenshot - Google Cloud Storage

Enter the bucket’s name and Storage Class (Regional is okay for our use):

Screenshot - Create Bucket

Extracting Data

Now you’re ready to load the data into KBC. Go to the extractor section, and click New Extractor:

Screenshot - Extractors

Use the search and find the BigQuery extractor:

Screenshot - BigQuery Extractor

Create a new configuration and name it, e.g., bls-unemployment:

Screenshot - New Configuration

Then authorize the account:

Screenshot - Big Query Configuration

Name the authorization, and follow the on-screen instructions:

Screenshot - Big Query Authorization

If the authorization is successful, configure the extraction:

Screenshot - Big Query Authorized

Select your Google Project and the bucket you have created (bls-tutorial) above:

Screenshot - Big Query Configuration Detail

After that configure the actual extraction queries by clicking the New Query button:

Screenshot - Big Query Bucket Configured

Name the query, e.g., unemployment rates and paste the following in the SQL query field:

SELECT * FROM
  `bigquery-public-data.bls.unemployment_cps`
WHERE
  series_id = "LNS14000000"
ORDER BY date

The LNS14000000 series will pick the unemployment rates only. Uncheck the Use Legacy SQL setting, and Save the query configuration:

Screenshot - Query Configuration

Now run the configuration to bring the data to KBC:

Screenshot - Finished Configuration

Running the extractor creates a background job that

  • executes the queries in Google BigQuery.
  • saves the results to Google Cloud Storage.
  • exports the results from Google Cloud Storage and stores them in specified tables in Keboola Connection Storage.
  • removes the results from Google Cloud Storage.

When a job is running, a small orange circle appears under Last runs, along with RunId and other info on the job. Green is for success, red for failure. Click on the indicator, or the info next to it, for more details. Once the job is finished, click on the names of the tables to inspect their contents.

Exploring Data

To explore the data, go to Transformations, and click on Sandbox. Provided for each user and project automatically, it is an isolated environment in which you can experiment without interfering with any production code.

Screenshot - Transformations

Click on New Sandbox next to Jupyter:

Screenshot - Create Sandbox

Select the unemployment rates table (in.c-keboola-ex-google-bigquery.unemployment-rates), click on Create Sandbox. Wait for the process to finish:

Screenshot - Sandbox Configuration

When finished, connect to the web version of the Jupyter Notebook. It allows you to run arbitrary code by clicking the Connect button:

Screenshot - Sandbox Credentials

When prompted, enter the password from the Sandbox screen:

Screenshot - Sandbox Login

You can now run arbitrary code in Python, using common data scientist tools like Pandas or Matplotlib. For instance, to load the file, use:

import pandas
df = pandas.read_csv("/data/in/tables/in.c-keboola-ex-google-bigquery.unemployment-rates.csv",sep=',')
df.head()

The path /data/in/tables/ is the location for loaded tables; they are loaded as simple CSV files. Once your table is loaded, you can play with it:

import matplotlib.pyplot as plt
years = df.groupby(df['year'])['value'].mean()
years.plot(kind='line', color = 'orange')
plt.xlabel("Year")
plt.ylabel("Average %")
plt.suptitle('US Unemployment Rate', size=15)
plt.show()

Screenshot - Sandbox Result

Adding libraries

Now that you can experiment with the U.S. unemployment data extracted from Google BigQuery (or any other data extracted in any other way), you can do the same with the EU unemployment data. Available at Eurostat, the unemployment dataset is called tgs00010.

There are a number of ways how to get the data from Eurostat – e.g., you can download it in TSV or XLS format. To avoid downloading the (possibly) lengthy data set to your hard drive, Eurostat provides a REST API for downloading the data. This could be processed using the Generic Extractor. However, the data is provided in JSON-stat format, which contains tables encoded using the row-major method. Even though it is possible to import them to KBC, it would be necessary to do additional processing to obtain plain tables.

To save time, use a tool designed for that – pyjstat. It is a python library which can read JSON-stat data directly into a Pandas data frame. Although this library is not installed by default in the Jupyter Sandbox environment, nothing prevents you from installing it.

Working with Custom Libraries

Use the following code to download the desired data from Eurostat:

import subprocess
import sys
subprocess.call([sys.executable, '-m', 'pip', 'install', '--disable-pip-version-check', '-q', 'pyjstat'])
from pyjstat import pyjstat
dataset = pyjstat.Dataset.read('http://ec.europa.eu/eurostat/wdds/rest/data/v2.1/json/en/tgs00010?sex=T&precision=1&unit=PC&age=Y_GE15')
df = dataset.write('dataframe')
df.head()

The URL was built using the Eurostat Query Builder. Also note that installing a library from within the python code must be done using pip install. Now that you have the data, feel free to play with it:

years = df.groupby(df['time'])['value'].mean()
years.plot(kind='line', color = 'orange')
plt.xlabel("Year")
plt.ylabel("Average %")
plt.suptitle('EU Unemployment Rate', size=15)
plt.show()

Screenshot - Sandbox Result

Wrap-up

You have just learnt to do a completely ad-hoc analysis of various data sets. If you need to run the above code regularly, simply copy&paste it into a Transformation.

The above tutorial is done in the Python language using the Jupyter Notebook. The same can be done in the R language using RStudio. For more information about sandboxes (including disk and memory limits), see the corresponding documentation.

Final Note

This is the end of our stroll around Keboola Connection. On our walk, we missed quite a few things: Applications, Python and R transformations, Redshift and Snowflake features, to name a few. However, teaching you everything was not really the point of this tutorial. We wanted to show you how Keboola Connection can help in connecting different systems together.

Return to the beginning or contact us.