Combine or Select from multiple files
Overview
One of the key benefits of using Power BI is that you can mash different data sources together. For example, you can bring in sales transactions and combine it with data on customers from your CRM system in order to analyse which customer segments are buying the most and which you need to focus your marketing efforts on.
With a single source for facts (numerical data) and sources for each dimension (Customer, Employee etc) which relate easily to the fact table, building a star schema is quite straightforward. However, in the real world it is seldom as simple as this. Often the numerical data is scattered across multiple files, perhaps from completely different areas, with some different columns, at a different level of granularity, and discrepancies in naming conventions make linking the data together more difficult.
This topic looks at how to work with combining numerical data across multiple sources, in particular using the query editor to create a single fact table, or conforming dimensions across multiple fact tables when it is more appropriate to keep them separate.
We will start with looking at creating a star schema from a single flat source, however, as many of the concepts here are the foundation for working with multiple files. Next we will look at scenarios where we have multiple files with transactional data – either by the same or a different granularity of data. We can analyse the dimensions and grain of dimensions by using a Bus Matrix.
To create a star schema from these files we need to consider whether we can combine the numerical data into a single fact table (if they are by the same dimensions at – or which can be brought to – the same grain), or whether we need to keep them as separate fact tables (where we have different dimensions or dimensions at a different grain to be retained).
The topic Work with data at different grains looks at combining normalized dimension tables (attributes split out into their own tables) into a single dimension table better suited to BI reporting.
Creating a star schema from a single flat source
This section looks at the worked example Creating a star schema from a single flat file. The source is a very simple excel worksheet with sales on particular dates – the columns are product ID, Product, Product Number, Date, Sales Quantity and Product Price
We could load this data straight into Power BI and create a report from it – we could create Measures to calculate the total sales and analyse it over time or by product.
How we model this depends on the business questions we need to answer; let’s assume what we want to understand from this is the amount/proportion of sales which are driven by expensive vs cheap products, and whether the type of product people buy is seasonal.
Thinking about this from the perspective of business entities/dimensions, we have two clear candidates here: Product and Date. Each one of these will have multiple attributes – description and product number are attributes of the product dimension and we can create a calendar table with multiple attributes of each date such as day, month and year. Therefore we want to create a date dimension table and a product dimension table.
With regards to the facts, we would be interested in sales quantity and sales amount (a calculated field of sales quantity multiplied by price) so these are the fields we should look to have in our fact table.
We start with creating a parameter called FolderPath which contains the folder where we have stored the excel file we are using as a source and connect to this folder path using the folder connector (renaming and disabling load on the query), as shown below:
We then reference this query, filter on file name for the file we are interested in (filtering down to a single row), remove other columns and then bring in the data table as a query called 1_SalesData (Also disabling load on this):
Next we create two references to these queries – calling one fctSales for our Sales fact table, and one called DimProduct (load should remain enabled for these two). With our fctSales Query, we choose just the product ID (the unique identifier for each product), date and the numerical columns:
In this sales fact table however, the exact price is not a relevant consideration (just whether the product is expensive or cheap which is an attribute of the product dimension); however we need the price here to calculate total sales and then we can remove the Price column from our fact table:
Coming now to our DimProduct table, we want this to be the unique list of products – one row per Product ID. We choose the product columns (including Price) and remove duplicates from the table, checking that the product id is unique by temporarily applying a keep duplicates step with the Product ID column selected:
Having the product price as a dimension attribute isn’t particularly helpful however as it is difficult to slice and dice by specific numerical values. Instead we can add a conditional column to determine the product category (an attribute we can slice and dice by) and then remove the price column:
To create our date dimension, we can use a ready-made M script from Chris Webb which will give us a table with one row for each continuous date between the start/end date provided along with relevant attributes for this date:
Finally we need to ensure all the data types are correct across all our tables (it is often best to do this as a last step yourself rather than accepting the automatic change type step that is often created), and close and load to our data model.
We have loaded three tables to our data model which reflect a star schema:
Note that building out dimensions from a transaction table still comes with limitations. Our products dimension, for example, will only contain products that sold in the period that we are loading data for. This means that we wouldn’t be able to perform an analysis on the products that didn’t sell. For this, we would instead need to source our product dimension data from a reference master data table of all possible products. If such a source is available, it is likely to be richer than the product information contained in the transaction table and it is preferable to use this.
Multiple files with the same Dimensions/granularity
Perhaps the most common scenario is this where there are multiple files in the same format to be combined- e.g. one file per period, or a split of data across another dimension (e.g. a separate file for each product).
Combining these files can be achieved using the append functionality in the Query editor and then, once the files are combined, the dimension tables can be created as separate tables by following the steps in the section above.
Files which have the same Dimensions but at a different grain
Consider the Actual vs Budget worked example which has actuals data at a more detailed grain than budget data. The Actuals data is at the granularity of Segment, Country, Date, Product and discount band; the budget data however is just at the granularity of Segment, Country and Month.
Actuals:
Budget:
How we deal with this scenario also depends on the business questions we want the report to answer; if we want to be able to analyse actuals by individual date and product (which don’t exist in the budget table) then we will likely need to keep them as separate fact tables (See below).
If however we want to combine these into a single fact table then we will need to only have dimensions which are common between them (Segment, Country) and bring them to the same grain (i.e. to the month level rather than individual date).
To bring the actuals data up to a monthly granularity, we need to first add a column with the month attribute of the date. To reflect the same as the budgets table it is actually the first date of the month which we need. This can be achieved by using the Add Column -> Date -> Start of Month:
And we can call the column Month so that it is the same as the budget table.
Next we need to group by this column (and others that are common with the budget table – Country and Segment) so that we can get the total units sold by Month, Country and Segment. This is achieved by selecting these columns, and then choosing ‘Group By’ in the Home ribbon. We want to group by those columns and add a column for the sum of total units:
This gives us the same columns as the budget table, so our new actuals grouped table looks like:
And our budget table looks like:
Now these tables have the same columns (at the same grain) so can be appended into a single fact table as we did above. We want to be able to differentiate the actual units sold to the budget units so keeping the column names different will ensure that they appear as two separate columns (with nulls in the budget units column for actuals data/in the total units column for budget data):
Alternatively if we wanted to have the values in a single ‘Total’ units column then we could rename ‘Budget Units’ in the budget table to ‘Total Units’. We would, however, need a way of differentiating between budget and actual figures so before appending we could add a custom column to each with the same name (e.g. ‘Scenario’) with the value of Actuals or Budget:
And once we’ve done the same thing with the budget table (ensuring the custom column name is exactly the same) the resulting data table looks like:
We can then consider creating our dimension tables from this to create a star schema – we may consider separating scenario as a separate dimension given that this will be a necessary filter to apply in any calculation (as budget and actual values cannot be combined).
Files have some different dimensions/are at a different grain
In the scenario above, whilst we have achieved a simple data model with a single fact table, we have lost expressivity in doing so. The actuals data can no longer be analysed by product or individual date. In this example it is actually likely to be preferable to keep the actuals and budget tables as separate fact tables, and have the common dimension tables linking to both fact tables, e.g:
This also probably makes more sense from a dimensional modelling perspective: sales and budget data represent different processes/events so should be represented as separate fact tables.
In the scenario where we have multiple fact tables, splitting out common attribute columns into separate dimension tables becomes necessary in order to have a common basis for analysis between them. In the example above, if we kept the actual and budget tables as separate instead of combining them, we could split out the common attributes (e.g. segment, country) each into their own dimension table and use these in slicers/filters/axis on the report in order to compare budget and actual numbers.
In order to generate the data model shown above, we need to create dimension tables for Segment, Country and Month which connect to both fact tables. It is important that the dimension values in each table contains the unique list across all fact tables, however. There may be Segments in Country that exist in either Actuals or Budget tables but not the other (and they are across different date ranges), so in order for the dimension tables to be able to link to both tables, the dimensions must be combined/conformed across them.
Taking the segment dimension first, we start with creating two intermediate queries each with the unique dimension values from the Actuals table and the budget tables. This is achieved by referencing the actual/budget queries, removing other columns and then removing duplicates from the Segment column as shown:
We can then append the two new intermediate queries into a query with combined Segment dimension values removing duplicates and removing the null value:
Whilst not strictly necessary, the data model will be more performant by joining on a segment ID rather than on the name, so we can add an index column to our new dimension table and then merge the fact tables back onto this to return the corresponding ID to replace the Segment name.:
We then repeat the same steps for the Country dimension. Our list of queries then looks like:
Our Actuals table looks like:
And the budget table:
For the date dimension, we need to have a table of unique month values which joins onto the date table (which in turn joins onto actuals), as well as directly onto the budget table on month. It needs to contain all dates across both the budget and actuals table.
Rather than using the fact tables, for this we can use the date table as we have already ensured it covers the full range of dates across both Budget and Actuals.
The date table contains a month number and year number but not a Year Month key/attribute. To create this, we can reference the date table as a new query, use Column from Example to create our month attributes and then remove duplicates to ensure our table has one row per Year Month (N.B: If the monthkey is not identified correctly through columns to example it can be added as a custom column with the Formula Date.ToText([Month],”yyyyMM”):
Once we’ve added the same monthkey to the Budget table, we can then load the tables to the Power BI data model and create the relationships as shown in the relationship view diagram above.
Dealing with inconsistent data
This scenario becomes slightly more complicated when there are data inconsistency between the original data tables. In the Hawaii Tourism example, the different datasets refer to same country in different ways – e.g. Switzerland is called SWITZER-LAND* in one table and SWITZ in another. To deal with this, I created a manual mapping table (in Excel or using the ‘Enter Data’ function) that had the various column values and a corresponding common ‘conformed’ name for each country which we want to show in our data model as follows:
In each table (One was called B7 and one B8), we then merge against this table to return the corresponding conformed name (or index number). The dimension table then has the conformed name and the index number along with any other attributes of the country (e.g. longitude/latitude) which we want to include. This is shown in the video below.
Selecting the latest file in a folder
Rather than appending all the files in a folder, sometimes you want to just select the file for the latest date. In the Selecting the latest file in a folder worked example, there are a series of files in a folder which have the date (in YYYYMMDD format) as a suffix in the filename.
The goal of the exercise is to select only the file with the latest date and bring in the contents of this file as a new query. Whilst we could use the metadata of the file (date created, modified etc) this is generally a less reliable method of picking the date as it can get changed if the file location is moved for example. Where we have the date already part of the filename it is best to take it from there.
The first step to do this is to pick out the date from the end of the filename. You could write your own custom column to do this, although in the first instance it is worth trying Column from example to see if it will pick it out reliably:
Next we want to format the column as a date, sort descending and then keep only the top row:
We can then click on ‘Binary’ in the content column in order to read the contents of this file. Note that the code generated by this is = #”Kept First Rows”{0}[Content]. This indicates that it is taking the content of the first row of data in the query. This ensures that even when there are new files in the folder with different names, it will take the file that is the latest rather than a specific filename.