Handling common CSV operations with Cure
Introduction
One of the downsides of writing a utility like Cure that no one really uses is you can’t rely on Google to remember how to do things. So, I’ve written this post to help me remember how to use it for both common and obscure tasks.
If you are unfamiliar with Cure, it is a utility that helps you manipulate, clean, and export CSV files. It is written in Ruby, and uses a DSL to define the operations you want to perform on your data.
Common use cases are to manipulate CSV files into other CSV files, or to ingest cleaned data into your application. Unlike a lot of other tools, Cure is designed with complex and oddly formatted CSV files (think invoices that companies might print out) in mind.
This post assumes you have installed Cure, if you haven’t, visit Cure for instructions/Dockerfile.
Full docs can be found here.
Common Operations
- Joining multiple CSVs into one
- Storing data on disk instead of in-memory
- Cataloging templates with metadata
Extraction
- White/Blacklisting columns
- Extracting two different parts of a CSV file
- Using headers not found in first row
- Ignoring parts of a CSV
Building
Cleaning
Exporting
Projects and automation
Advanced
- Leveraging
call
in your templates - Maintaining state during transforms
- Keeping transforms between runs
Common Operations
Joining multiple CSVs into one
Imagine you had two CSV files, user_spend.csv
and users.csv
, and you wanted to join them into one file.
users.csv
UserId | Name |
---|---|
1 | Johnny |
2 | Tim |
user_spend.csv
UserId | Spend |
---|---|
1 | 100 |
2 | 200 |
and you wanted output.csv
to look like:
UserId | Name | Spend |
---|---|---|
1 | Johnny | 100 |
2 | Tim | 200 |
Your query step can be as complex as you like, and can include any number of joins, unions, or subqueries.
Storing data on disk instead of in-memory
By default, Cure stores working data in an in-memory Sqlite3 database, but you can also store it on disk.
There are a few reasons you’d want to keep the database.
- If your file is very, very big (gigabytes or larger). Although Cure is pretty good at streaming large files, it can still run out of memory.
- If you want to keep the data between runs. By default, Cure will drop the table on initialisation, but you can change this behaviour to keep the translations between runs. Lets say you want to anonymise a group of invoices, but you want the same randomised account number between all files, you can do this by keeping the database.
Cataloging templates with metadata
Cure allows you to add metadata to your templates to help you remember what they do, or to help others understand what they do.
This metadata can be searched.
Extraction
White/Blacklisting columns
If you want to exclude or include columns you can use white/blacklisting.
Extracting two different parts of a CSV file
If you have a CSV file that has two different parts, you can extract them separately.
A | B | C | |
---|---|---|---|
1 | Employees | ||
2 | Name | Position | Salary |
3 | John Doe | Manager | $60,000 |
4 | Jane Smith | Developer | $50,000 |
5 | |||
6 | Projects | ||
7 | Project Name | Deadline | Budget |
8 | Website Redesign | 2022-12-31 | $20,000 |
9 | Mobile App | 2023-06-30 | $50,000 |
Note that the named range notation follows the same rules as Excel (A1:B2
to specify a range).
Using headers not found in first row
If you have headers in an obscure place, you can specify the row they are expected to be in.
A | B | C | |
1 | John Doe | Manager | $60,000 |
2 | Jane Smith | Developer | $50,000 |
3 | Name | Position | Salary |
Ignoring parts of a CSV
If you have rows you want to exclude, you can remove them with including
.
Input:
Name | Position | Salary |
---|---|---|
John Doe | Manager | $60,000 |
Jane Smith | Developer | $50,000 |
Bob Johnson | Analyst | $45,000 |
Output:
Name | Position | Salary |
---|---|---|
John Doe | Manager | $60,000 |
Jane Smith | Developer | $50,000 |
Bob Johnson | Analyst | $45,000 |
Building
Adding new columns and deriving data
As useless as a new empty column sounds, it can be used for a placeholder column to be used later. A common example of this may be if you want to add a variable to each row. For example, at the top of a spreadsheet, you may have a date, but you want to add that to each row.
In this example, we will use a translation with a proc to calculate the total cost of an item. Note proc’s all get handled the cell source (default in this instance is 0), and the context of the row, which includes the row itself.
Input:
Item | Item Count | Item Cost |
---|---|---|
Apples | 10 | 1 |
Bananas | 15 | 3 |
Cherries | 20 | 1.50 |
Output:
Item | Item Count | Item Cost | Total Cost |
---|---|---|---|
Apples | 10 | 1 | 10 |
Bananas | 15 | 3 | 45 |
Cherries | 20 | 1.50 | 30 |
Copying data between columns
Self-explanatory, but you can copy data between columns.
col_a |
---|
a |
changes to
col_a | col_a_copy |
---|---|
a | a |
Renaming or removing columns
Very self-explanatory, but you can rename or remove columns.
Cleaning
Anonymising data
If you have sensitive data, you can anonymise it.
It is important to note, once the vendor matches again from another row or column, it will be the same translated value.
Rot13
is not a strong anonymisation technique, but it is a good example of how you can use a proc to achieve an outcome.
Replacing values with variables/placeholders
You can replace values with variables extracted from the sheet, or from static placeholders.
Invoice date and total are extracted from the sheet, but the account number is a static placeholder.
This adds all of those columns to each row.
Exporting
Exporting data to a database instead of CSV
Depending on the amount of data you are working with, there are several ways you might want to approach loading data into a database. You might want to do batch inserts or just persist every single row, but importantly it doesn’t really change in principle of how to yield out that data.
The first part is to create a class (you could use a lambda/proc if you wanted to, it just needs to respond to #call
)
to hold the logic.
row_handler.rb
The RowHandler
class will store all the rows in memory, you could easily modify this to write to a file or a database,
either singlularly or batched. Note that the named_range
is the key to access the rows in the RowHandler
instance, we can store
all data in one instance, or we could hand them individual ones too if desired.
Since you create the RowHandler
instance, you have full control of what its instatiated with, you can hand it a database connection
and off you go.
Splitting a CSV into multiple smaller files
Lets say you have a CSV that is 200,000 rows long. To make it easier to work with, you might want to split it into 10 files of 20,000 rows each.
Projects and automation
Creating a new project
If you have Cure installed, creating a new project is as simple as running cure new [name]
.
Automating common tasks with a script
There are many ways to automate a Cure job, but the simplest creating a bash script.
Below is an example of a script that will run a simple translation template on all CSV files in a directory.