Combine Product analytics with Subscription data - Part 1
When working in product analytics, I had one painful experience.
Occasionally, I chatted with the people from BI, and they told me about this magic thing called Data Warehouse. It sounded like a paradise—the vast amount of information and data from all these systems that were out of reach for me.
Thanks for reading timo's substack! Subscribe for free to receive new posts and support my work.
"Well, we do a lot of reporting with the subscription data, you know. The management loves these MRR dashboards."
To be honest, I was deeply jealous about this. I also wanted to create something that the management loved. They only asked me to tell them what features are helpful.
That was one of the reasons why I was always on the lookout for ways to enhance product analytics data. I tried them all - server-side SDKs and proxing webhooks. It got me enriched data, but it was tedious and not really a controllable way to move forward.
My goal was clear: I wanted the Data Warehouse experience for product analytics. So, I went for extreme measures and even switched to the BI side, building modern data stacks. Here, I tried to find ways to do event data modeling (which was already complicated) and finally found a way to do efficient sequence analysis and discovery (something nastily slow with SQL).
It took four years until the light was at the end of the tunnel. I had a formalized way to handle event data modeling (activity schema), and finally, the first product analytics tools in the data warehouse arrived on the scene. And the big solutions like Mixpanel followed with their data warehouse offering. Finally, I could do the cool things the BI people did before.
Obviously, I wanted to start with subscription data (I wanted some love).
What do we want to achieve?
In this project, we combine behavioral product data of a whiteboard application with the subscription data we have already pulled in our data warehouse using Fivetran (I just picked it because I assume this is the most common integration - but it has issues; more on this later).
The behavioral data is already in Mixpanel. The subscription data is stored as snapshots in our data warehouse. Our task is to bring both of them together in Mixpanel. Why in Mixpanel? I explain in a second.
Finally, we want to create our core metrics and investigate the impact of behavior on subscription-related activities like subscriptions created or canceled.
All data in the Data Warehouse or Mixpanel?
This topic alone could be a substack post. But we keep it shorter here for the sake of readability.
First of all, both ways work.
The natural first instinct would be to pull everything into the data warehouse. This is the classic bi-driven modern data stack approach. All data sources come together in one place to be transformed and modeled.
If we would only be interested in reporting metrics and some dimensions, this would be no problem. We would use the subscription data to calculate all subscription-relevant metrics and the mix panel data to calculate all product-related metrics.
But I want to go a step further. I want to analyze the sequences that include behavioral and subscription events because first, I want to understand better what activities signal a subscription activity (like created or canceled). Second, based on that, I can support the customer success and growth teams with the right audiences for specific communication tests.
And yes, I could do this with SQL as well - but to be honest, I can spend my time far better - this is a usual SQL query for just a simple funnel analysis:
https://medium.com/cube-dev/sql-queries-for-funnel-analysis-35d5e456371d
That's why we combine everything here in Mixpanel since it unlocks more possibilities for analysis. In this way, we can show the MRR metrics and investigate how product behavior influences subscription revenue. In Mixpanel, we have the opportunity to explore funnel, retention, and cohorts to identify the group of users that have a significant impact on MRR growth.
And before we can do this, we need the data to be in event shape. However, not all data comes that way, so we need to eventify our Stripe data first.
Eventify Stripe data
To be fair, there is a way to get Stripe event data. Stripe offers an event API endpoint where you can get an extensive log of all activities within Stripe:
https://docs.stripe.com/api/events
But if you use Fivetran to load Stripe data in your data warehouse, you won't get the event data (if you use Airbyte, you can get it). So, we picked the Fivetran example here since it is a good exercise for learning how to transform snapshot data into event data. A process that I call Eventify and described here:
In a nutshell, we look for timestamps and identifiers. Everything that has a timestamp is a candidate for an event. So, this can be a created_at column in the production database table combined with a user and/or customer ID.
So what do we get? The starting point for all of this is naturally the Fivetran docs about the Stripe integration:
https://fivetran.com/docs/connectors/applications/stripe
From there, I want to check the schema:
This helps me to look for potential event candidates.
Fivetran, for example, uses history tables for some integrations. History tables are a great start to finding event data since they materialize the activities' results. Here we have a subscription history table:
So, we keep this in mind for our event sourcing.
Invoices are another good candidate since they are usually created when a subscription is renewed, which leads us to another event.
In the end, we look for occurrences of timestamps and identifiers.
Let's start eventifing
First "subscription created" - the essential event we can use to calculate the New MRR metric.
We have two places to get it: the subscription history and the invoices table. Both should hold the information about when a subscription has been created. I pick subscription history since it is closer to the object.
Looking at the subscription history table, I can find the `start_date` and `created` fields. That looks good. Now, we need to understand the difference. Luckily, Stripe has one of the best-documented APIs around:
It comes down to how we have handled subscriptions so far. We could not use created if we had backfilled subscriptions when we migrated to Stripe. In general, start_date looks like the better pick.
Now we need an identifier - the best candidate is naturally `customer.` But this is Stripe's customer ID, which is likely not what you use for your product. So, you might need to join/map this with your application account/user ID.
To create this event, our query would look like this:
Here, we get the information from the history table that is relevant for us, and we build a row number rank because we are just interested in the first instances of all subscriptions.
In the next step, we get more context information by joining subscription_item and plan tables. We can use this context later to calculate the MRR and break it down according to the different plans. Here, we also make sure to pick only the first item.
In this final step, we bring it all into the schema of the ActivitySchema. This helps us model all events similarly and easily combine them in one table if needed.
`Subscription renewed` - should be triggered whenever the subscription cycle ends and is renewed (automatically).
It might also be possible to get this information from the subscription history table since the current period, start and end, is noted here, and a new entry will be created with each change. But here I went for the invoice table, assuming every subscription renewal creates a new invoice. This needs to be checked with your data - I check different samples to get a good idea about the idea and then decide about the table I use.
The model looks like this:
Again, we build a row_number to exclude the first invoice generated when the subscription was created.
Here, we assume that all invoices are related to the subscription. This can be different in your setup.
The final step is again to bring everything into ActivitySchema.
You can find the entire repository with the data model here: https://github.com/deepskydata/demo_stripe_activities
With our first events created, it's time to combine the data with our product data in Mixpanel.
We will continue this in Part 2 (I know, brutal cliffhanger).
This post is sponsored by Mixpanel. For me, the simple reason is that I use Mixpanel's Data Warehouse sync in most of my projects. As described above, I want to combine data warehouse data with product data (and I even now sync the product events from the data warehouse), but I like my usual analyst experience. This is the reason why I use this combination so often.
You can learn more here (https://link.timodechau.com/mixpanel-s)
Thanks for reading timo's substack! Subscribe for free to receive new posts and support my work.