There are different ways to write about the modern bi stack. A16z wrote a great piece looking at the architectural level showing how the BI stack has changed and how the new world has a place for different (and not so different) approaches: Emerging Architectures for Modern Data Infrastructure
Justin Gage recently did an excellent summary of the different toolsets of a modern BI stack: Technically: the analytics stack And Alexandra Abbas impressed me with the roadmap to becoming a data engineer covering a wast landscape of a BI stack: GitHub - datastacktv/data-engineer-roadmap: Roadmap to becoming a data engineer in 2021
So what’s my take now. I like opinionated products. They don’t try to come up with an approach that works for most people. They have their distinctive style. If you love it, great, you will love the product. If you don’t like it, stay away. Basecamp is still the best example for it (even though I don’t use the product - just because I don’t work like that).
Because I work like that - is the topic of this series. The guidelines are just based on the way I like to work with data. If you don’t want to work like that, no problem, just don’t read the post; they will not resonate. If you like a bit about how I work with data - give it a try.
Here are my principles of how I work with data:
Centralize all transformation, mapping, and business logic into one place where you have full control of the data. This place is usually the data warehouse in modern BI stacks, and the data model is applied to it. Using tools like dbt or dataform or come up with home-brew modeling (still dreaming of a dbt based on Pandas) enables you to manage all complexity in one place.
This approach frees up space to make the integration simple (Extract, load, transform) and also to let visualization (“BI ”) tools make what they can best: visualize (and not data modeling).
Everyone is talking about self-service in BI. Yes, I can understand that BI teams don’t want to become the dashboard factory. But calling for self-service BI is no solution.
Why? It takes a full-fledged dev-ops setup to enable software engineers to span up proper servers to test their current development branch.
It takes a full-fledged data-engineering, ops, and governance setup to enable self-service. We are talking about a job for at least 5-6 people. Don’t you have that? Forget about self-service BI.
Maybe we can do Kafka; what about Flink, Apache Beam sounds excellent; perhaps a data lake is still a cool thing.
Each week, a new service in the BI stack appears and looks honestly pretty shiny. Try me out, it calls. We love this call since it goes deep into our nerd’s hearts. We live to try out new tech.
But this is something for the sandbox. For Friday afternoons, the next weekend project. Nothing for production.
Start simple, stay simple.
The best BI stack is useless if it does not provide insights that get the business more revenue or cost reduction. And it’s a long way from a data table to more revenue.
If you can’t help other teams to make better decisions with data, you are doing art and not a data-driven business.
So, each BI stack orients from the outcomes. Listen to the product, marketing, dev, hr, operations, and sales teams. What kind of data let them progress? Provide that data, measure the progress—next item.
Only when you achieve wins, invest in scalability. Not before.
For me, data engineering should apply the good stuff we know in software development:
But - see the pragmatic paragraph - it needs time to establish these since they often require at least a team.
These are the concepts I try to apply to each BI stack project I approach. Sometimes it works better; often, it needs some refinement. But this is my opinionated approach to BI stacks.
Next time I will write about why the Google cloud platform is the best choice when you have no time to benchmark.