OpenRefine is an open source project based on Google Refine. OpenRefine transformations allow row/cell text and number operations such as replacements and calculations. And you can easily clean up your dataset, too.
The OpenRefine transformation scripts are running in an isolated Docker environment.
The installed OpenRefine version is 2.6-rc.2
.
The Docker container running the OpenRefine server and transformation has 8GB of allocated memory. 7GB of that is allocated directly to OpenRefine. In our experience, this memory limit is enough to process ~2M rows or 0.5GB of raw CSV data. The maximum running time is 6 hours.
As OpenRefine uses only one dataset as its input, and produces one dataset as its result, the input and output mapping is limited to exactly one input/output.
To develop and debug OpenRefine transformations, you can replicate the execution environment on your local machine. To do so, have OpenRefine installed (preferably the same version as us).
To simulate the input and output mapping, all you need to do is create a project with the desired CSV file.
Then use the UI to modify the file according to your needs, and click the Extract button in the Undo/Redo tab.
And finally, copy the Operation History JSON to the transformation script.
We are working hard on preparing OpenRefine Sandboxes. After launching the OpenRefine server for you, we will load the desired table into the environment. Stay tuned for more information and, meanwhile, use OpenRefine locally.
OpenRefine transformations are currently in public beta. Some features may not work as expected. Please bear with us and provide feedback at support@keboola.com.
© 2024 Keboola