Design a Data Model
Overview
When data is imported into Power BI, the resulting data model sits at its heart. It drives the kind of analysis that can be carried out as well as its performance.
One of the key advantages of Power BI is the fact that it is a model-based tool; it contains the functionality to build a cube-like model within each report which is capable of performing lightning fast queries on billions of rows of data if the data model is designed correctly.
The industry standard for designing a BI data model is a based on the Kimball Group methodology of dimensional modelling. This topic will go through how to go about designing this kind of data model based on business requirements.
Designing a Data model starts not with the data but with the business questions that any reports need to answer.
Reports need to be specific to an audience, ideally to named individuals, who have specific challenges (described as problem statements) and questions they would like the report to answer.
Identify these questions and for each area of analysis, think about (and document as part of stakeholder workshops/meetings) :
- The KPIs or measures that would answer the questions.
- What those measures would need to be sliced and diced by in order to answer the questions.
This is intended to be quite a high level analysis and, whilst it needs to be done with a view to the type of data available, a ‘blue sky thinking’ approach should be taken where possible. Try not to be overly restricted/influenced by existing reports or analysis that is being done.
Below is an example of a scoping template I designed that allows you to capture this information. This particular example relates to the 360 Giving Challenge though similar examples can be found in the Example Report Topic.
Analyse the Data Fields
After completing the initial scoping exercise, look in more detail at the data and consider the real-life activities which it represents – e.g. a sales transaction, a trade, the posting of a journal or a stock-take.
Identify the numerical data points that are captured as part of that process. For a sales transaction this might be the sales price and quantity purchased; for a trade it might be the amount bought/sold and the price at which the transaction occurred. Note that these numerical data points (called ‘facts’) should be directly relevant to the business process. There may be other data which is interesting though not directly related (e.g. the salary of the sales person who processed the transaction); these can be noted but not considered as part of this ‘fact’ capturing activity.
At least one of these facts should be additive (i.e. it is meaningful to add values together – for example sales amount). Semi-additive facts (e.g. balances) or non-additive facts (e.g. unit price) are also useful to identify though will require different aggregations and potentially different data modelling techniques to make them work correctly.
The business process should be initially identified at the most transactional and granular level as possible. For example, considering a retail sales scenario, the attributes around each individual sales transaction could include the date and time on which the sale occurred, the store/location in which the sales took place, the salesperson, the customer, the product that was purchased (including its price at the time of sale) and the quantity that was purchased.
Consider the timeframe that you want to show data over and then consider whether data needs to be analysed at this level of granularity or not. For example, it is unlikely that you will need to drill into the individual date and time of each sale when you are looking at ten years worth of sales data; however it might be appropriate if the report is focused only on recent sales.
If you wanted to bring in all ten years of sales data, consider the possible aggregations you could utlilise to reduce data volumes (which will make the report faster). For example, you could aggregate total sales by day/week/month, by customer segment, by product category or any combination of these.
The detailed transactions (i.e. by date/time, for each individual customer and for each individual product) could then be brought into a separate report, which only considers a more recent period, or potentially brought in using Direct Query using Aggregations.
There are several technical features in Power BI to help choose the right Architecture for your data model. However, this decision should be based on the need of the end users of the report rather than on the technology choices available.
Data such as dates, customers and products are business entities in their own right; they have attributes of their own – e.g. a date has an attribute of day of week, month, year, as well as maybe fiscal year and a flag indicating whether or not it is a working day. Customers also have attributes such as name, address, or date of birth.
As such, these data blocks are described as business dimensions – they are the data columns that you group and filter by and, as explained above, can be at different levels of granularity. They represent what the users will be able to ‘slice and dice’ the numbers in the report by.
In contrast, the columns of data with numerical values, such as sales amount, which you want to aggregate in your report are known as facts.
Conceptual Modelling
The various data points can then be represented by drawing a simple diagram showing the dimensions as branches coming out of a circle with the facts (the measures identified as part of the scoping exercise) in the middle, like in the diagram below.
Each branch represents a dimension, and within each branch the individual points represent attributes of that dimension. Date is usually always one of the dimensions, along with some form of customer/user and product/service.
For the product branch, the attributes might include description, product category and colour. Having more attributes (columns) here will allow for richer slice and dice analysis within Power BI though shouldn’t be included if they haven’t been identified in the scoping exercise.
Even with very detailed/granular data, the attributes should be carefully chosen according to what you want to aggregate by in your report. For example, even though the data might be at a date level, you might want to show total sales by month (perhaps with the functionality for users to drill down into each date within a month), so it is important that the month attribute is captured as part of this model design.
Whilst this exercise is in no way technical (and should be done with end users of the report regardless of their technical level), this resulting diagram – known as a starnet -represents a conceptual design of the data model.
The validity of the conceptual model should be judged in reference to whether it can answer the questions previously identified: would the dimensions/attributes allow you to plot graphs/tables (whatever they may be) that would answer those questions?
By specifically calling out which columns of data you need, either as a fact column to be aggregated, or a dimension attribute, you are more importantly also defining what you don’t need. Any column which is not included on this diagram is by definition not required in the data model and can be removed in the query editor. Removing unnecessary columns will significantly improve performance of the resulting model.
This diagram defines the blueprint for how your data is to be shaped in Power BI – i.e. which tables you should have and which columns should be in each table.
Star Schema
For any given starnet, there will be one fact table – the central table that contains the numeric fact columns and links to dimensions – and one table per dimension/branch containing columns for each attribute indicated on the diagram.
This type of model is known as a star schema – one (or more) fact tables and multiple dimension tables. For an excellent description of what a star schema is see the Microsoft Docs article.
It is important that the dimension tables contain a unique identifier for each row, as per the granularity. For example if the grain of the data model is at an individual customer level then each row of the customer dimension table should represent one customer with columns for their various attributes. The unique customer ID should be included as a column in this table.
These IDs typically already exist when your data source is a relational database (e.g. SQL Server); where they don’t exist you can create your own key either as a concatenation of data columns which together make the row unique or by adding an index column in the query editor.
The fact table also contains these keys and only needs to have the dimension key value, not all the attribute columns. So for example the fact table would contain the customer ID for each sales transaction, and the customer dimension table would contain one row per customer/customer ID along with all its attributes.
In this way, the fact table is related to the dimension tables in a 1:many relationship – the dimension values or keys (e.g. customer ID, Product ID) is repeated many times in the fact table (e.g. for each order on each day) but only appears once in the corresponding dimension table.
A common dimension table in most models is the date dimension, also known as a calendar table, which has one row for each continuous date covering the periods which are represented in the data model.
Where there are multiple attributes to the dimension value (e.g. day of week/month/year are all attributes of the date dimension) then naturally these columns will contain repeated values; the year field will be repeated 365 times in the same column. Whilst this can be avoided by promoting say the year attribute to its own dimension (putting years in a separate table and joining it back onto the calendar table as a 1:many), this is generally unnecessary: repeated values are not an issue in Power BI due to the way that the data model engine works.
Note that if your data source is a relational database then it is likely that each attribute is split into its own table – e.g. you might have a separate table for customer addresses, and perhaps even a separate table for locations which are referenced in customer addresses. This type of model works well for a transactional table though typically slows down a BI data model. For a BI data model, it is better to combine attribute tables into a single dimension table instead.
Data Source as a single table
When your data source is a single wide table, whilst it is possible to bring this table as-is into Power BI and use it to create your visuals, it is still better to use it to create a star schema. This is achieved by identifying those data columns which can be grouped together and/or promoted to being a dimension.
How do you decide which columns should be a dimension and which should be kept in a fact table? The fact table(s) should be as slim as possible, so any descriptive attributes should generally be promoted to their own dimension table where possible. Only dimension keys and numerical values to be aggregated should remain in the fact table.
You can identify the dimensions by looking at which fields relate to business entities – e.g. customers, products or dates, with each entity having its own dimension table. Even if your flat source table doesn’t have any attributes of these entities, you might be able to find or create data sources for these dimensions. For example, from the HR system you may be able to get a list of all employees along with associated relevant attributes of each employee (such as location and business unit). For customers, perhaps you can get a list of all unique customers along with attributes from a CRM system. For dates, you can create a calendar table (or re-use one that you have standard code for).
Even for attributes which clearly don’t relate to a business entity (e.g. sort orders), it might be worth splitting them out creating a ‘Junk dimension’ table.
Data Source with multiple fact tables.
Where your numerical data is spread across different tables, you need to consider whether to combine them or keep them as separate. Where the tables have the same attributes (at least for all the fields that you want to analyse by), these tables can be combined into a single table.
If the fields are the same though at a different level of detail (e.g. if one field has data by date and the other at a month level), it may still be possible to combine them by grouping one of the tables.
Where the fields are different across the tables, and combining them would lose a level of detail that you want to analyse by, the tables can be kept as separate fact tables.
In this scenario, any attribute fields that are common across both the tables (and which you want to analyse both sets of numerical data by) will need to be brought into a unified dimension table containing all unique dimension values. This dimension table can then join onto both/all fact tables.
The Topic Combine or select from multiple files runs through the techniques for working with these scenarios in the query editor
In order to understand which attribute/dimension fields each table has, you can use the measures and dimensions granularity matrix (A variation of the Bus Matrix under the Kimball methodology).
This allows you to analyse the dimensions across different data sources. The measures (the numerical values) across the various data sources- shown here along the top – and the various dimensions (and grain of dimensions – e.g. showing month and year as separate rows), shown here along the side. You then put a tick or cross against whether that data source has the measure against that dimension/attribute.
This example is one I did for the Hawaii Tourism Competition:
In this example there were lots of spreadsheets each showing some of the same metrics but by different dimensions or by different grains. For example, some data sources had visitor numbers by island number and month, others by island and year, and others not by island but just month/year. Each data source was assigned a code (e.g. B6) which is shown along the top.
This exercise allows you to see where you can combine data sources which have measures by the same dimensions/attributes, and which dimensions are common to tables which you instead need to keep separate.
In this scenario whilst some of the tables could be combined, we ended up with multiple fact tables (each with slightly different granularity) and a handful of common dimension tables. The resulting data model looked like:
Having multiple fact tables within your Power BI data model adds a level of complexity which should be avoided if possible, though needing to keep multiple fact tables at different levels of granularity is a fairly common scenario and makes the need to stick with the data modelling principles described in Managing Table Relationships even more important. In this scenario, each combination of a fact table with its corresponding dimension tables is its own star and should be represented as a separate tab in the data modelling relationships view as shown below.