Synapse Delta Lakehouse and Power BI Enterprise Big Data Worked Example. Part 1: Start With the End in Mind
This series of blog posts covers Synapse Delta Lakehouse and an Enterprise Power BI architecture, working with 2.4 billion rows of NYC Taxi Data!
It covers many of the topics in the Microsoft DP-500 exam which you can take to become certified as an Azure Enterprise Data Analyst.
The series is split up into 4 blog posts:
- Part 1 (This post): Start with the end in Mind
- Part 2: Get Azure to do the heavy lifting
- Part 3: Leverage Enterprise Tooling in Power BI
- Part 4: Curate the user journeys
I also delivered a session on this for the UK Power BI User Group in July 2022 available on YouTube:
If you’re one of the 10k+ community users here at LDI, or even if you’ve just stumbled across this blog post, you’ll likely appreciate that big data really is big: we generated 64 zetabytes1 zetabyte= 1 trillion gigabytes! in 2020 alone, with Microsoft as the largest Vendor in the global Big Data and analytics market (Source: BI & analytics software market value worldwide 2019-2025 | Statista).
But how you actually work effectively with billions of rows of data using tools such as Azure Synapse and Power BI is far from obvious. There are lots of features, all of which are changing at a rate where it is nearly impossible to keep up. There’s lots of new terminology (like Delta lakehouses and Datamarts!) and seemingly many different ways to achieve a similar outcome depending on your specific scenario and requirements.
In this series of blog posts, I want to highlight some techniques for taking 2.4 billion rows of NYC Taxi trip data, processing it through Azure and delivering valuable insights through Power BI with decent levels of performance.
This data volume – and the corresponding challenge of delivering insights with performance – is not atypical of what you would find in a large corporate. In our case, the data source is an Azure SQL Datawarehouse with a taxi trip fact view (vTrip) which links to multiple dimension tables:
So the data is already quite clean and well structured, and is also very detailed: for every trip it contains details such as the exact pickup/drop-off location (longitude/latitude), the weather conditions at each location at the time and fare breakdowns including tips, surcharges and tolls.
The natural inclination is just to bring all this data into Power BI and start to visualise the facts/dimensions to be able to identify patterns and insights which you then show to your stakeholders. That’s what I used to do when I started as a Power BI analyst: I’d ask users what they wanted to see in a report – they usually either pointed me to existing reports to replicate, say they want to see everything, or I just got a completely different answer from everyone I asked!
I’ve since learnt to adjust my approach. I now focus on identifying specific business questions that the report needs to answer and work out the granularity (level of detail) of data, and corresponding data model, that I need before I start building anything in Power BI!
There are four key steps to my overall approach to working with this large data volume, each of which is covered in its own blog post:
1: Start with the end in mind (This post) – this means that you really need to consider your use cases and what kind of insights you need; you’re not going to be able to visualize billions of rows of data in a single visualPower BI visuals are limited to 30k data points by default!, so you’ll need to identity an aggregated level that it makes sense to display the data at. The output of this step is going to be a conceptual view of your data model which is the blueprint for any data shaping activities you need to do in Azure or Power Query. Note that nothing in this step is specific to any technology or tool – it’s all general data model design!
2: Get Azure to do the heavy lifting – Our source system is SQL DW and whilst we can connect it directly from Power BI and do our transformations there, it’s going to be very slow and not that scalable with these data volumes, so we’re going to use Azure synapse to do it instead! Synapse has features such as notebooks, mapping data flows, and copy activities in Azure synapse pipelines (the Synapse version of ADF) which we can leverage to get our data in the exact format we need for Power BI.
3: Leverage Enterprise tooling in Power BI – this is referring primarily to utilizing User-defined Aggregations in Power BI, where you can mix Import and Direct Query modeImport mode is the default where data is brought into Power BI; DIrect Query is where it is left and source and queried on the fly within the same model (or hybrid tables which is where you can mix them in the same table!) which is very useful for working with large data. We also look at Power BI Datamarts, Incremental Refresh and features such as Auto Aggregations for Direct Query.
4: Curate the user journeys: This is now taking your data model as a shared dataset and building the reports on top of it where you are answering the business questions you identified in step one. As we have two modes of analysis here – one high-level and one where we allow people to drill into the granular trip level data (using the direct query table for drill-through), we need to make sure this is reflected in our user interactions such as drill-through pages and bookmarks. For this particular report, I also had a scenario modelling tool with an embedded Power Appsusing what-if parameters in Power BI with paramater values passed into Power Apps/written back to a data source that Power BI is connected to so that potential taxi drivers could start to model how much money they would make under certain scenarios (location, no of trips per day/average duration, weather conditions etc).
Let’s now consider the starting point and the first question is not a technical one but rather a very practical one – who is the audience for this analysis? Only by answering this, by identifying specific people and understanding their problem statements and user storiesUser stories are a common way to describe features in software development!, can we start to design an effective data model/report.
There could be a broad audience of people potentially interested in taxi trips taken in New York but , given how much it costs, the people with the most at stake with this are probably those potentially considering getting a taxi medallion. I did a quick google search on NYC taxis: It used to cost $1.3m for a medallion, now (with the arrival of Uber/Lyft etc) it’s closer to $160k but that’s still a huge amount of money! If you were thinking about buying a medallion, you’d want to know things like how long it would take to get your money back, how much money you can make per day under what conditions, and how you can maximise those fares.
This specific problem statement gives us a clear analytical focus. We know we need to analyse historical data on trips in order to establish which type of trips (e.g. short vs long) and which areas/conditions are the most profitable. This will be challenging given the volume of data and the need to distil this huge amount of data into insights!
So, hypothetically, let’s speak to some potential taxi drivers and find out what kind of insights they’re looking to get. I like to capture these in a simple scoping template as a Word Template:
You can download an example of the Scoping template here
From our hypothetical interviews, we have discovered that these potential drivers have questions such as:
- Is it better to do a few long trips or lots of short trips per day
- What time of day it is best to take trips?
- How does the weather affects the trip amount – e.g. do taxi drivers earn more on days when it is raining?
All of these things will affect the amount of money you can earn as a taxi driver!
Once we have identified the key business questions we need to answer, we can then look at the kind of measures that would help us answer those questions and what we would need to view and filter those measures by in order to be able to answer those questions. Those of you familiar with Data Warehousing concepts will recognise these as identifying facts and dimensionsFact tables refer to those tables with numerical values and keys/links to dimension tables which contain attributes but there’s no need for technical jargon here – we want to do this exercise with non-technical users and so we stick to using accessible language.
Next you need to document your data. For each field, detail an example value and a description (This can often come from source system documentation if it exists!)
It is important to have this documentation early in the project. It can be used as part of the metadata for your data model which makes it far more usable as a shared dataset. It also helps with your own process of getting familiar with the data and validating your understanding of the data with stakeholders.
Now consider what numerical data points you have by what grainThis is the level of detail/which attributes that we captured our numerical data by of attributes. We can use the traditional method of a bus matrix for this where we plot our numerical fields by the attributes/level of detail of those attributes that we have those numerical data points by.
For example, here we have the trips by date and time (the lowest granularity) but perhaps in a different dataset we might have only captured by a higher grain (e.g., number of trips per day, week, month or year). If we have something by a lower grain, we automatically also have it by a higher grain – e.g., if we have the date of trips, we also have each trip by the corresponding week, month, quarter, and year.
We can use the Bus MatrixThe Bus Matrix is a key tool in the traditional Data Warehousing design known as the Kimball method (or my inverted version of it below!) to plot numerical values against attributes/grain of attributes:
This shows the various measures/numerical values along the top, and the various dimensions/grain of dimensions along the side; here we can see for each trip all the details we have – e.g. the date and time (which can be aggregated into month/year) and the longitude/latitude of pickup and drop off location (which can be aggregated up to a higher grain of city/state for example).
This type of analysis is particularly useful where we have different datasets with numerical values and we want to consider whether we want to combine them into a single fact table or keep them as separate fact tables. Having more than one fact table in our data model is an entirely valid implementation of a star schema but it makes our model slightly more complicated/harder to understand so we only want to do it if we want to keep the fact tables at different grains in order to allow for different levels/types of analysis.
For example, we might have both actual sales data as well as budgets/forecasts, though it is quite likely that our budget/forecast sales are at a higher level than we capture our actuals – e.g. we might forecast our sales by month rather than going into the individual detail of how many sales we expect on each individual day. When we want to bring both our actuals and our budget/forecast data into the same data model, we could either aggregate our actuals data by month to bring it to the same grain as our forecasts (but doing so would lose expressivity – we would be unable to then see our sales data by day!), or we can keep actuals and forecasts as two separate fact tables. By doing this, when we are viewing by month as a high-level view then we can compare both actuals and forecasts in the same visuals; when we drill down into days we can either apportion our budgets into a day level using DAX or only show actuals when we’re at the day level.
Applying this to our NYC taxi data, we only have one dataset (at a very granular level of detail!) but we can still have two fact tables at different grains in our data model. The advantage of doing this would be that we could have one fact table in import mode that is at a higher grain (less detailed) which could serve most of our reporting needs with high performance (as it is leveraging the vertipaq engineThis is the Analysis Services engine behind Power BI where data is held compressed in a format optimised for querying), and one detailed trip-level fact table in Direct QueryWhere Power BI will fire off a query on the fly back to the data source to bring back the specific trip records we need that we only need to use when we drill into details of a small number of individual trips. This is the pattern of user-defined Aggregations that we will explore in a subsequent blog post though for this exercise we just want to plan what level of detail we want in our fact tables so that we know what data preparation steps we require.
To determine the level of detail we need in our fact tables, we need to critically review the data based on what users actually want to see. If a user is looking at 2.4 billion trips, it likely doesn’t make sense to try and plot them all on a map as a visual can’t display this many points! So for Geography an aggregation level of state is probably all we need for our overview analysis.
Similarly, when looking at the weather is the amount of rain in inches and the exact temperature of each trip that relevant, or do we just care about whether e.g. it is raining or not raining, or whether it is hot or cold? That level of categorisation would make the data actually more useful as it would allow us to understand the relationship between the weather and fares/trips. The fact that those categories don’t currently exist in our dataset doesn’t matter as we can create them as calculated (conditional) columns. Remember that whether we need calculated fields, and which calculated fields we need in our dataset, is something that we want to identify at this stage so that we can create them as part of our Extract, Transform & Load (ETL) process.
An updated version of the Bus Matrix with these aggregations looks something like this:
This takes our data down from 2.4 billion rows to only 1.8 million rows! By aggregating our data by categories such as month, by zip code/state (rather than precise pick-up location), by time buckets (rather than individual times), we have significantly reduced data volumes without really using any level of expressivity that would be useful for our overview trips analysis. When we’re looking at billions of trips over 13 years, viewing them by month and the categories that we have defined is likely granular enough. And if our calculations are being done against a data model that is only 1.8 million rows it will be much faster than us trying to aggregate on the fly against the detailed trip data. Note that our analytical users are likely to still want to be able to see the individual transactions/trips and we will give them the ability to drill-down into a subset of trips where they can see all of the trip detail in a table. This way, we can keep all of our users happy and ensure good performance even with large data volumes!
Now that we know we need two fact tables, both at different grains (but with several common dimensions), we can design what our data model will look like. One way to do this would be to take/mock up a small sample of the data (e.g. in Excel) and bring it into Power BI, creating the relationships as required and then using the model diagram as our reference Entity Reference Diagram (ERD).
However, as data model design is a generic process that can be applied in any BI application, let’s keep this as a non-technical exercise which we can do on a whiteboard or mock up in Word/PowerPoint!
Starnets are diagram views of our data model that allow us to see which are the key dimensions/attributes that we have linking to our fact tables.
We will have a separate diagram for each of the “stars” in our star schema – i.e. for each combination of dimensions and fact table. Let’s start with our aggregated fact tableThis is the table with our number of trips and total fares by the key attributes/dimensions needed for high level analysis . First, we just draw a circle and put our key measures/numerical values in it:
For our aggregated table, we don’t have the specific fare/tip amounts for each trip but rather we just have the number of trips, the total fare and the total tip amount calculated by all of our categories/dimensions. There were also other numerical values in our underlying data (e.g. surcharges and tolls) which we’re not bringing in because we’ve not identified them as specifically relevant to our analysis of how much money taxi drivers can make. (Note that in reality this might actually be relevant – that’s why we need to do the up-front exercise of working out what is and isn’t relevant to the analysis we need to do!).
Next we draw out branches for each of our dimensions, e.g. one of the fields in our aggregated dataset is Duration Bucket (a category of how long the trip was for e.g. 5-15 minutes, 15-30 minutes):
On the line we can now add the key attributes of this dimension that we need in our table. In this case it is only one – the duration bucket itself:
We then repeat this for all our other dimensions and key attributes. Where our attributes represent a hierarchy (e.g. dates) we denote these with a circle rather than a line. Our completed Startnet diagram now looks like.
We can then do another starnet diagram for the detailed fact table – here in the circle in the middle we put the numerical values that we have against the individual trips (e.g. trip amount, fare amount).
There will be a number of tables such as Weather and Time which are only relevant dimensions at this level of granularityRemember that granularity refers to the level of detail that we have captured our numerical data by - defining the grain is the first step in the kimball methodology! – i.e. they won’t appear in our aggregated starnet because they’re a level of detail that we’re not capturing in our aggregated view. For all of our new category dimensions (e.g. Rain/Temperature Category) we also want these on our detailed starnet – our detailed fact table will still contain these categories as calculated columns so that we can use them as a common dimension for analysis across both our detailed and aggregated views.
This might seem like duplication – e.g. to have both a Time dimension table that has all the individual times of the day in it, linking to our detailed fact table, as well as a Time Category dimension table that has just the time categories (e.g. mid-morning, late afternoon) that links to both our detailed and aggregated table. Indeed Time Category is just an attribute of Time so we could just have it as a field there, but then this would create a many:many relationship between our Time Dimension and our Aggregated fact table which would likely have a negative performance impact with this volume of data. It is better to have both dimensions, one linking only to the detailed fact table, and one linking to both.
These starnets now tell us what level of aggregations we need to apply (which we could do with e.g. a Group By in SQL or Power Query), what calculated fields we need, and what the relationships will be between the resulting tables. It is much better to design this up-front rather than trying to work it out once our data is all loaded into Power BI.
We can also use these Starnets to check that our resulting data model will allow us to answer the business questions that we have identified. Going through the questions from our scoping template, we can check that we have captured all the right descriptive attributes and numerical values to be able to answer those questions e.g. “Is it better to do lots of short trips or fewer long ones?” – we can answer this because we have the total fare/tip amount by trip duration. To really get to the bottom of that specific question, we would need to factor in how many short vs long trips a potential taxi driver could do in a day – we will do this as part of our scenario modelling as the last topic but for now given that we’ve designed our data model, let’s get to the actual process of ingesting and shaping our data with Azure Synpase Analytics!
Interested in training for yourself and/or your team on the Microsoft technology stack? Fill in the training questionnaire below and receive a personalised roadmap and material! (Note that if you are signed in all applicable fields will be pre-populated)
Privacy Overview
Cookie | Duration | Description |
---|---|---|
cookielawinfo-checkbox-analytics | 11 months | This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics". |
cookielawinfo-checkbox-functional | 11 months | The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional". |
cookielawinfo-checkbox-necessary | 11 months | This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary". |
cookielawinfo-checkbox-others | 11 months | This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other. |
cookielawinfo-checkbox-performance | 11 months | This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance". |
viewed_cookie_policy | 11 months | The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data. |