An introduction to analyzing your email data using SQL and a Redshift datastore.
Using data sent to a Redshift database from Vero, we’ll put together a chart that will help you start to conceptualize how you could build out a detailed and accurate Email Dashboard for your organization.
This will also further the skills you need to analyze your email marketing data when joining it up with the rest of your customer data, using your own data warehouse and SQL charting tools.
A few weeks ago, we talked about how you could graph your Net Churn using SQL. This was to introduce you to the world of data warehousing and detailed analysis using your own data.
The post generated huge interest so this week we’re focusing on email, an area in which we aim to set the benchmark for what’s possible.
Charting open rates for your latest newsletter campaigns
Here’s an example of the chart we’ll be building.
What you’ll need to do this yourself
Here’s the setup we use to get accurate and complete data and to chart it beautifully:
- We use Vero to send all of Vero’s emails. Vero’s campaign management platform helps us create and manage our customer journey so we know who we’re sending what, and when. We send both automated and newsletter email campaigns with Vero.
- Segment’s integration automatically plugs into Vero and allows Vero to send back data on all email deliveries, opens, clicks, unsubscribes and so on into their platform.
- RJ Metrics’ Pipeline product syncs all of our Segment behavioral data to a Redshift database.
- Our data warehouse is hosted on a basic Amazon Web Services (AWS) Redshift cluster.
- We chart our reports using Periscope Data.
- We also use our own database with data about our users (in this case, we won’t be using this data, but we will in the future). Our application uses a mixture of datastores, but a PostgreSQL database contains the most relevant data.
This might sound complex, but all of these steps are a one to five-click process to get running. Once running, you have a really robust pipeline of user interaction data that you can use to build out sophisticated analyses any time you want.
Building your ‘deliveries’ and ‘opens’ data models
When you’re getting started with your SQL email analysis, the hardest part is getting your data in order.
Message your users without syncing data
Connect CSVs, Google Sheets, Airtable, databases and more, define an audience and send a newsletter — all without syncing your data again.
The first step to creating the chart above is to create two SQL Views, one for your deliveries, and one for your opens. These views will represent a nice, clean view of all opens and all deliveries in your datastore, transformed and filtered to remove any junk. We’ll then use these tables to build out our charts.
Let’s dive into deliveries first.
Using the setup described above, your email data sends up in a Redshift database table via Segment. This is part of a large table called
track, representing every user action that you have tracked via Segment (note: this includes any on-site activity, not just email deliveries).
For this analysis, create an SQL view called
vero_analysis_deliveries_base. As you can see, this will view returning all of the raw fields you need for your analysis.
create or replace view vero_analysis_deliveries_base as ( select veroproduction.track.event as event, veroproduction.track.user_id as user_id, veroproduction.track.context__traits__email as context__traits_email, veroproduction.track.properties__campaign_name as properties__campaign_name, veroproduction.track.properties__email_subject as properties__email_subject, veroproduction.track.properties__email_type as properties__email_type, veroproduction.track.timestamp as timestamp from veroproduction.track );
This view can now be queried anytime and has only the data you need to create your email analysis charts.
The next step is to create an SQL view called
vero_analysis_deliveries_filtered. In this view, select everything from the previous view, and filter out anything you don’t need. In this case we don’t want to filter anything, but it’s worth highlighting this step to help you learn to structure your data better.
create or replace view vero_analysis_deliveries_filtered as ( select * from vero_analysis_deliveries_base where vero_analysis_sent_base.event = 'Email Delivered' and vero_analysis_deliveries_base.timestamp is not null );
At this point, these views include all the data we need to create our tables, so it’s time to transform and normalize the formats for various columns (such as date columsn).
In this case we only want to see
Email Delivered events (not the other Segment events) so you’ll want to filter out any other data in the
To do this, create a view called
vero_analysis_deliveries_transformed. This table will select everything from the
vero_analysis_deliveries_filtered table and transform various columns, like the
received_at column, into a consistent format.
create or replace view vero_analysis_deliveries_transformed as ( select vero_analysis_deliveries_filtered.user_id as user_id, vero_analysis_deliveries_filtered.context__traits_email as user_email, vero_analysis_deliveries_filtered.properties__campaign_name as campaign_name, vero_analysis_deliveries_filtered.properties__email_subject as campaign_subject, vero_analysis_deliveries_filtered.properties__email_type as campaign_type, vero_analysis_deliveries_filtered.timestamp::timestamp as date from [vero_analysis_deliveries_filtered] order by date desc );
That’s it! This is the final view you need to do the analysis. At this point the data is accurate, clean and it has it’s columns formatted in a way that makes it easy to query any time.
Creating your chart
Now that you’ve got everything you need, you can begin charting.
To create the chart mentioned in this post you need to query a table that has three columns:
deliveries (the number of emails that were delivered),
opens (the number of emails that were opened) and
open_rate (the percentage open rate).
Here’s the SQL to do this.
with latest_newsletters as ( select campaign_name, max(date) as date from [vero_analysis_deliveries_transformed] where campaign_type = 'newsletter' and campaign_name not like '%CLONE%' group by campaign_name order by max(date) desc limit 10 ), opens as ( select campaign_name, count(user_email) as opens from [vero_analysis_opens_transformed] group by campaign_name ), deliveries as ( select campaign_name, count(user_email) as deliveries from [vero_analysis_deliveries_transformed] group by campaign_name ) select opens.campaign_name, deliveries.deliveries, opens.opens, (opens.opens::decimal / deliveries.deliveries::decimal) as open_rate from opens join deliveries on opens.campaign_name = deliveries.campaign_name where opens.campaign_name in ( select campaign_name from latest_newsletters )
What is presented here comprises of three key steps.
Firstly, this SQL creates a temporary table that returns only the top ten most recent newsletters that you’ve sent. It does this by looking at the deliveries table and, for each campaign (grouped by the campaign name), returns the time of the last (
max) email that was delivered.
You can then order by this field and return just ten campaigns. It’s also important, for this example, to filter down just the newsletters and not other automated campaigns (since you’ll have data on all of your emails from Vero).
Secondly, this SQL creates temporary tables that count how many deliveries and opens there were per
campaign_name. Each of these temporary tables is separate. They return just the
campaign_name and the count for the type of interaction we’re looking at.
Finally, we combine the campaign_name and opens temporary tables to return the
campaign_name, the count of emails delivered and the count of emails opened. As part of this step we can do some maths to return a decimal number representing the percentage of emails that were opened.
That’s it! The final output is a table that you can use to create the graph above.
I charted my example using a bar chart to help compare absolute opens with absolute deliveries. I then used a line chart to highlight large peaks in open rate on a second Y axis. This gives a clearer perspective of each metric.
This post was just an appetizer. This is a very basic analysis of your email data and forms the basis for much more complex, useful and interesting analysis.
If this looks useful to you, check out Vero.
If you are a Vero user and want to learn more about how you can do this drop us an email and we’d love to talk about it.
Share your thoughts and ideas in the comments below! I’d love to see what sorts of analyses you’ve built!