The Data Journey: Unlocking data for the right now
We're all in the middle of a multi-step process to consolidate, model, and operationalize our data. Here's where we're getting stuck, and what to do about it.
Over the last couple of years, I've seen the same pattern emerge again and again at eCommerce retailers and other big companies that handle large volumes of data.
These companies are in the middle of a three-phase journey that usually takes at least 5 years, and which most haven’t completed. These are the phases:
- Consolidate
- Model
- Operationalize
Allow me to explain in more detail.
Phase 1: Consolidate
When you move to a microservices architecture, you generate tons of information silos, because every service has its own private repository. This is a pain when you want to start using the data in each service to better understand the business. It's especially hard when each service has a different database technology for its repository. You want to be able to merge information from multiple services into a unified repository so you can query all that data with less effort and less risk.
That's why the first step you take on this journey is to consolidate the data from all of these private data stores in a data lake, ideally on the cloud for scalability and flexibility.
When you bring all this data into cloud, it theoretically enables other members of the business, especially at the top levels, to run analytics on all of the business’ data without touching the private data stores. They get their analytics and they leave production alone.
Data lakes consolidate disparate repositories in the cloud, but lack the data models that help explain the business.
But that’s just a starting point, and most non-technical or even semi-technical people can’t make use of data in a data lake unless it’s well-modeled. “Data swamp” is a well-worn term because of this. So, the next phase of the journey is to take your raw, consolidated data, and massage it so it actually represents the business. That’s called data modeling.
Phase 2: Model
Once you’ve got all your data in a unified cloud data lake or warehouse, you hire some data engineers to model and normalize it so that your analysts can run queries and build dashboards and have confidence that the data they are querying actually represents the reality of the business. You want to give the analysts the data products they need, and not let them burn through your monthly Snowflake credits in a day with spaghetti queries.
You usually want to implement this with a domain/entity organization, where every vertical domain is accountable for modeling the entities they own for the rest of the company. This fits well with the Data Mesh Principles that so many people are excited about these days.
The goal here is to guarantee a single source of truth for definitions and logic, and add some level of automation and observability using tools like dbt. It's pretty easy to see why this tool is so popular right now, given these premises.
The goal of data modeling is to provide a single source of truth for the business to cost-effectively build reliable ML models and run accurate BI reports.
Now, let me be clear: I don’t know of any company that has finished this phase of the journey. Data modeling never stops, because data creation never stops. But there are a growing number of companies with really smart data teams that have mostly mastered this phase.
And when you get to that point of mastery, it's a huge leap forward.
You enable data scientists and ML teams to build models that can predict the future. And you give analysts and their executive stakeholders a single source of truth by which they can analyze the past.
But what about the present? What about the right now? This is Phase 3.
Phase 3: Operationalize
As I mentioned, many companies are still focused on Phase 2. But a growing number have matured their data practices to the point where they are moving on to this next phase:
"Ok, now I have our data modeled in BigQuery, and I feel good about it; but how do I build an application on top of that? How do I consume these metrics or insights in near real-time? How do I activate this data to change the business right now?"
If your company is like most companies that have invested heavily in the cloud data warehouse, you're probably having a hard time with this one.
Your first attempt might be to build a REST data service on top of your data warehouse. I've seen this so many times, and the results usually aren’t what you hope for.
The reason is simple: Data warehouses aren’t built for Phase 3. They’re great for long-running analytical queries. They're perfect for BI. And they’re not too shabby for AI/ML either.
But they're just not built for interactive use cases where speed and scale matter (a lot).
Data warehouses weren't designed to support fast user-experiences built on data.
User-facing and operational analytics that rely on fast-moving data have latency, concurrency, and data freshness requirements that simply cannot be satisfied by cloud data warehouses. That’s just not what they’re for.
So, there’s a gap in the “modern data stack” that’s keeping you from unlocking Phase 3.
What should you do?
You’ve made a tremendous investment in your data infrastructure. Phases 1 and 2 didn’t come easy. They took a lot of work, and they are accomplishments in and of themselves.
So nobody’s here to try to convince you to scrap Snowflake, or BigQuery, or Redshift, or whatever cloud data warehouse you’ve put your blood, sweat, and tears into.
And I definitely don’t think you should ditch your battle-tested data warehouse for some new-fangled CDW that claims to give you all the speed you need. What you gain in speed, you'll likely lose in other important areas. Plus, I feel for anyone trying to make that migration.
But I also don’t think you should try to gut it out with the warehouse you have and make do with an architecture that, by design, will never satisfy the requirements that you’re pursuing in Phase 3.
Here’s what you should do: Branch out. Find new pathways to solve these new problems in a way that works apart from or in parallel with the data warehouse you’ve put so much into. Keep using the data warehouse for what it’s good for, but don’t let it hold you back.
Keep using the data warehouse for what it’s good for, but don’t let it hold you back.
This is where Tinybird fits. It’s made for Phase 3 of the Data Journey.
Tinybird exists so that data teams and developers can build applications that need to be able to serve millisecond responses to complex analytical queries and deliver a fast user experience with the freshest data possible.
In pursuit of this, it does two things:
- It gives you a blazing-fast analytical database (ClickHouse) that can support complex analytical queries on streaming data, with low latency, at scale.
- It gives you a set of tools to publish and operationalize those transformations in a way that aligns with how developers build products: SQL, APIs, version control, testing, observability, serverless scale…
Tinybird exists so that data teams and developers can build low-latency, high-concurrency applications at any scale.
In the coming weeks and months, my coworkers and I are going to dig deeper into a few architectural approaches: Tinybird before, after, and in parallel to the data warehouse. We’ll explain the objectives, pros, and cons of each. The goal is to help you better understand where Tinybird fits in your stack, and even more broadly, to signal that it’s possible to build a modern (that is to say, fast) user experience without jerry-rigging the a16z “modern data stack” or abandoning the warehouse entirely.
Here’s to unlocking Phase 3 of the Data Journey, and going fast in the process.