OpenRefine Transformation

OpenRefine is an open source project based on Google Refine. OpenRefine transformations allow row/cell text and number operations such as replacements and calculations. And you can easily clean up your dataset, too.

Environment

The OpenRefine transformation scripts are running in an isolated Docker environment. The installed OpenRefine version is 2.6-rc.2.

Memory and Processing Constraints

The Docker container running the OpenRefine server and transformation has allocated 8GB of memory. 7GB of that is allocated directly to OpenRefine.

In our experience, this memory limit is enough to process ~2M rows or 0.5GB of raw CSV data.

The maximum running time is 6 hours.

Inputs/Outputs

As OpenRefine uses only one dataset as its input, and produces one dataset as its result, the input and output mapping is limited to exactly one input/output.

Development Tutorial

To develop and debug OpenRefine transformations, you can replicate the execution environment on your local machine. To do so, have OpenRefine installed (preferably the same version as us).

Screenshot - OpenRefine Welcome Screen

To simulate the input and output mapping, all you need to do is create a project with the desired CSV file.

Screenshot - OpenRefine CSV Load

Then use the UI to modify the file according to your needs, and click the Extract button in the Undo/Redo tab.

Screenshot - OpenRefine Operation History

And finally, copy the Operation History JSON to the transformation script.

Screenshot - OpenRefine Extract Operation History

OpenRefine Sandbox

We are working hard on preparing OpenRefine Sandboxes. After launching the OpenRefine server for you, we will load the desired table into the environment. Stay tuned for more information and, meanwhile, use OpenRefine locally.

Public Beta Warning

OpenRefine transformations are currently in public beta. Some features may not work as expected. Please bear with us while we polish all necessities. Any feedback is welcomed at support@keboola.com.