Vero logo
  1. All Posts
  2. /
  3. Clean, merge or filter CSVs with DuckDB

Clean, merge or filter CSVs with DuckDB

Data Management

Nearly any task in marketing means using a CSV. Even with the best marketing stack in the world…it just seems to happen!

If you’re a B2C or B2B PLG business, CSVs get big, fast. Working with large CSVs in Micsoroft Excel or Google Sheets has it’s limits.

DuckDB is an awesome tool that any technical marketer or marketing operations user can put to work when interacting with CSVs.

A real world use case

Imagine I have a CSV called websites_using_shopify.csv. A list of public websites using Shopify. Imagine this CSV contains the following columns:

domain,linkedin_url,last_seen,page_rank

Imagine I also have a CSV called websites_using_amplitude.csv. It has the same columns and lists out all of the public websites using Amplitude.

I want to filter the list of websites using Shopify and subtract any that are using Amplitude. I want to keep all the columns intact. I.e. not lose any data.

I could open both files in Excel. Then copy/paste websites_using_amplitude.csv into a new tab. Then use some sort of COUNTIF or VLOOKUP to see if the domain in the websites_using_shopify.csv tab are also in the websites_using_amplitude.csv tab. Then filter or sort and, finally, delete the records I don’t need.

But that is rather painful. And, at scale, Excel may well crash before I get my results.

Using DuckDB

For those familiar with SQL and relational databases, think of DuckDB as enabling you to interact with your CSV via SQL without copying your data anywhere or loading it into a database.

Once installed, here’s how we can handle our real world use case.

Firstly, we start duckdb in our terminal. Make sure your terminal is inside a directory containing both CSV files.

We "load" websites_using_shopify.csv into a table like so:

create table shopify_websites from (select * from websites_using_shopify.csv);

This "creates" a table in memory. It’s not stored anywhere on your hard drive or in the cloud, it’s just in your temporary memory and will be deleted once we close duckdb. This is awesome as it’s perfect for playing around with CSVs.

Let’s create a similar table for websites using Amplitude:

create table amplitude_websites from (select * from websites_using_amplitude.csv);

Now we’ve got our two temporary tables, we can do this:

select * from shopify_websites
where domain not in (select domain from amplitude_websites);

This will filter out any domains that appear in the Amplitude CSV, whilst preserving all the data in the Shopify CSV.

We can easily write this to a new CSV file:

copy (
    select * from shopify_websites
    where domain not in (select domain from amplitude_websites)
) to 'filtered_websites_using_shopify.csv';

There you have it: a filtered CSV of websites using Shopify, preserving all of the original columns.

DuckDB is super handy and I encourage you to give it a try next time you’ve got to work with big CSVs.

Want to send more personalized mobile and email messages to your users?

Check out Vero, customer engagement software designed for product marketers. Message your users based on what they do (or don't do).

Get started

Consider signing up for a free trial. No credit card required.

Vero Cloud Workflows