- All Posts
- /
- Clean, merge or filter CSVs with DuckDB
Clean, merge or filter CSVs with DuckDB
Data Management-
Chris Hexton
-
Updated:Posted:
On this page
Nearly any task in marketing means using a CSV. Even with the best marketing stack in the world…it just seems to happen!
If you’re a B2C or B2B PLG business, CSVs get big, fast. Working with large CSVs in Micsoroft Excel or Google Sheets has it’s limits.
DuckDB is an awesome tool that any technical marketer or marketing operations user can put to work when interacting with CSVs.
A real world use case
Imagine I have a CSV called
websites_using_shopify.csv. A list of public
websites using Shopify. Imagine this CSV contains the following
columns:
domain,linkedin_url,last_seen,page_rank
Imagine I also have a CSV called
websites_using_amplitude.csv. It has the same
columns and lists out all of the public websites using
Amplitude.
I want to filter the list of websites using Shopify and subtract any that are using Amplitude. I want to keep all the columns intact. I.e. not lose any data.
I could open both files in Excel. Then copy/paste
websites_using_amplitude.csv into a new tab. Then
use some sort of COUNTIF or VLOOKUP to
see if the domain in the
websites_using_shopify.csv tab are also in the
websites_using_amplitude.csv tab. Then filter or
sort and, finally, delete the records I don’t need.
But that is rather painful. And, at scale, Excel may well crash before I get my results.
Using DuckDB
For those familiar with SQL and relational databases, think of DuckDB as enabling you to interact with your CSV via SQL without copying your data anywhere or loading it into a database.
Once installed, here’s how we can handle our real world use case.
Firstly, we start duckdb in our terminal. Make sure
your terminal is inside a directory containing both CSV files.
We "load" websites_using_shopify.csv into a table
like so:
create table shopify_websites from (select * from websites_using_shopify.csv);
This "creates" a table in memory. It’s not
stored anywhere on your hard drive or in the cloud, it’s just in
your temporary memory and will be deleted once we close
duckdb. This is awesome as it’s perfect for playing
around with CSVs.
Let’s create a similar table for websites using Amplitude:
create table amplitude_websites from (select * from websites_using_amplitude.csv);
Now we’ve got our two temporary tables, we can do this:
select * from shopify_websites
where domain not in (select domain from amplitude_websites);
This will filter out any domains that appear in the Amplitude CSV, whilst preserving all the data in the Shopify CSV.
We can easily write this to a new CSV file:
copy (
select * from shopify_websites
where domain not in (select domain from amplitude_websites)
) to 'filtered_websites_using_shopify.csv';
There you have it: a filtered CSV of websites using Shopify, preserving all of the original columns.
DuckDB is super handy and I encourage you to give it a try next time you’ve got to work with big CSVs.