Work with queries in the query editor

team-member-1

Rishi Sapra

Technical Community leader, speaker, trainer and evangelist specialising in Power BI and Azure. Formally recognised by Microsoft as a Most Valuable Professional (MVP), Fast Track Recognised Solution Architect (FTRSA) and Microsoft Certified Trainer (MCT).

Tags:

Overview

The query editor is the area to connect to data and shape/transform it before it is loaded into the data model. A query within the editor is a table of data, from source or created with code/ with reference to another query, with zero or more transformation steps applied.

Each query is defined by code in a language called M/Power Query language. For the most part you don’t need to write this code as it is automatically generated when you connect to data and apply transformations through the query editor UI.

However, it is useful to look at the code that is generated by enabling the formula bar (an option in the view ribbon of the query editor) and/or by selecting ‘Advanced Editor’ from the Home ribbon:

Code can be copied and pasted from the advanced editor and entered into another query, either an existing or a new one. (You can create a new query by going to Get Data -> Blank Query). It can also be pasted into a dataflow to be run separately from the report in the PBI service.

 

A look at the Advanced Editor

Let’s explain the Advanced Editor with an example. See screenshot below. The code starts with a “let” statement (in blue). This is followed by a series of “steps”. Each “step” is assigned a “variable name” (e.g. Source).  Each variable/step returns a value, list or table of data, and subsequent steps after Source generally refer to the previous step variable.

These apply some transformation to your dataset.

The “step” variable that represents the end result of the query is the variable named after ‘in’ (in blue): ‘# Added Custom Column

Whilst the default is for each step to refer to the previous step, again this is not strictly necessary. When it references an earlier step, however, this means that all the ‘applied steps’ do not appear as a sequence in the query settings pane)

The step variable names are prefixed with a hash with the name in speech marks in the example above (and in almost all the recorded steps) because the step name contains spaces; if instead the step is named to something without spaces then the hash and speech marks are not required and the resulting code looks considerably neater and easier to read:

The steps can be renamed directly in the advanced editor, being careful to ensure you manually change the following step reference each time as well, or by right clicking on the step in the query settings pane and selecting Rename:

I like to name steps with a sequential letter prefix (It can’t start with a number) followed by an underscore and a name which describes what the step actually does, as in the example above.

A common scenario working in the query editor is to use the UI to apply transformation steps and then edit the code using the advanced editor/formula bar to tweak an existing step. For example, if you had a step to “Unpivot Other Columns” and subsequently changed your source such that there was another column to be excluded from the un-pivot, it is likely to be quicker to tweak the formula that is generated than to delete and re-apply the step.

If you wanted to write your own steps, you could just start typing a new variable name, use an appropriate M function, referencing the previous step and then passing in the column names and other required values into the function.

Comments can be added to the code by using the syntax shown here:

Organising Queries

When a query is generated it appears in the queries pane on the left-hand side of the editor. If this is not visible then select the option to view it from the view tab. Queries can be selected (multi selected using Ctrl/Shift), copied and pasted into another PBIX file or pasted into a text editor/word document (in which case all the M code will get copied over which can be useful for documenting or version controlling the queries):

The queries are organised into folders called groups. By default there are no groups (and once there are, new queries are initially placed in the ‘Other queries’ group) though new groups can be created by right clicking in a blank space in the queries pane and selecting ‘New group’. Existing queries can be moved into a group by selecting, right clicking on them and selecting ‘Move to Group’:

The names of the groups should represent the types of queries, e.g. staging queries (see below) in one group, the final fact table queries loaded to the model in one group and dimension tables in another. Parameters and functions should also potentially be in their own groups.

Combined with naming conventions of queries and query steps, the organising of queries into groups is one of the most important ways to make the transformation steps easier to understand.

 

Intermediate/Staging Queries

Reference vs Duplicate

Intermediate queries are ones which are not loaded to the data model (the ‘Enable load’ checkbox is disabled for them); rather their purpose is to serve as a starting point for one or more other queries in the query editor.

Disabling load for an intermediate query is achieved by right clicking on it and de-selecting the ‘Enable load’ checkbox. To then use the output of this intermediate query as a starting point for another, right click the intermediate query and select ‘Reference’:

Intermediate queries can be used to break up a series of logic steps across multiple queries. Rather than having dozens of logic steps in a single query, you can have queries in a chain each with only a handful of steps. This makes the transformation logic more readable overall, particularly when each query relates to a set of activities and is named as such.

It also allows for the query to be a starting point for one or more other queries, increasing the robustness of the overall logic and allowing for logic branching. Staging queries are a type of intermediate query and are a good example of this. Staging queries contain only connections to the data (e.g. folder locations or servers) with no additional transformations applied. This query is then referenced as a new query for each new file/table that is required.

In the example video above, a parameter is created called FolderPath and then a staging query (0_FilesInFolder) which just connects to this folder. There may be multiple files sitting in this folder (or sub folders) and we want to bring in several of them as separate queries. Rather than each query connecting directly to a hardcoded folder source, the queries for each file can all be references to the staging query – the step of connecting to the folder is common to all of them anyway, it is only the navigation to the specific file (and any subsequent transformations) which is unique to individual queries.

This reduces the number of times the same steps are repeated across multiple queries which helps to drive consistency, efficiency and readability. This is in contrast to the other option when right-clicking on a query – duplicate. Duplicating a query  creates another query with identical steps, equivalent to copying the code from the Advanced editor and pasting it into a new blank query.

Query Branching

Both referencing and duplicating queries are ways of creating new branches of logic. This can be seen in the query dependencies window (accessible via the View ribbon):

The query 0_FilesInFolder is the staging query which connects directly to the folder source via the parameter FolderPath. The query references 0_FilesInFolder and 2_LatestDateFile references 1_FilesWithDate which can be seen in the top branch. The query 2_DuplicateOfQuery1 forms a new branch as it duplicates 1_FileswithDate. Creating another reference of 0_FilesInFolder (as does the query 1_AnotherReferenceToQuery0) also creates another branch.

For the reason of reducing redundancy and simplicity, in general referencing a query is preferable to duplicating it.

Branching logic is necessary for the simple reason that a query cannot refer to a later query in the same branch. For example you cannot merge 1_FileswithDate with 2_LatestDateFile because the latter query is a reference of the first! However, you can merge 1_FileswithDate with either 2_DuplicateOfQuery1 or 1_AnotherReferenceToQuery0 because they are in a different branch. Avoiding this circular reference is a scenario where you need to either duplicate a query in order to isolate it into a separate branch, or create a new branch by referencing an earlier query.

You would potentially choose to duplicate if the steps applied in the new branch query are very similar to the corresponding one in the other branch though this is just a matter of practicality. However you can merge 1_FileswithDate with either 2_DuplicateOfQuery1 or 1_AnotherReferenceToQuery0 as they are in separate branches.

There are cases where creating a new logic branch is necessary – e.g. if the new branch has different set of filter criteria/transformation steps. Sometimes you might need to create two branches

In theory referencing also improves performance by having some steps (e.g. connections to source) performed only once. This isn’t necessarily the case given the way the engine works, however.  Whilst improving readability, the default options for Power BI can make using intermediate queries actually less performant due to the fact that a data preview is loaded for each query. Performance of working with referenced queries in the query editor can be helped by de-selecting the option for background processing in Power BI Desktop options:

Appending Queries

Consider the Actuals vs Budget data where the actuals data is split across two files – one containing sales in H1 2019 and the other sales in H2 2019. The structure of both files is identical, they both contain exactly the same columns.

Using the same method as described above to bring in both files as separate queries means that our query editor now looks like:

Note that both the queries have load disabled as we don’t want to load the H1 and H2 files separately but rather the combined sales data.

To then combine the data in these queries together, we click on the dropdown next to Append Queries in the Home Ribbon, select ‘Append Queries as New’, select the option to append two tables and select the tables:

This creates a new Query called Append1 which we can rename to something more meaningful. The append is clever enough to not bring across the header rows multiple times so the new query contains the right rows of data we need. Note that if one of the files had columns that didn’t exist in the other, it would bring it as a column in the appended dataset though with nulls against those rows from the dataset that didn’t contain the column.

Whilst this works well as a one of exercise, if you need to periodically pick up and append new files (e.g. if new extracts are placed into a folder each day/week/month), then instead of this method you can use the append from folder functionality within the query editor.

This will automatically append all files in a folder provided they have the same structure. So once this is set up, each time the report is set to refresh (up to 8 times per day on a Pro licence), it will take all the files in the folder and append them/apply any other transformations automatically with no manual intervention required.

To do this, select Folder as the data source, point to the folder that contains all the files (CSV, JSON or Excel format) to append and select ‘edit’. This will list all the files in the folder along with file properties:

At this point, it is possible to filter down the list to only include specific files to append (e.g. ensuring only excel files are included by filtering on the extension) or to create a calculated column e.g. to extract text out of the filename (see section on Selecting the latest file in the Combine or Select From Multiple Files topic).

To combine the data you only need the content column so unless you want to retain any other columns, select the content column, right click and choose ‘Remove Other columns’. (Note that if you do want to keep any other columns as an additional metadata column –  e.g. the filename-  then hold down ctrl and select all the columns you want to keep before removing the others). Then click on the double down arrow in the top right hand corner of the column heading to combine it:

Where you are combining Excel (rather than CSV) files, it will come up with a dialog box to select which file you want to use as a template for combining the files; assuming they are have all the same columns (which they need to for this to work anyway) then it doesn’t matter which one you pick; just selecting OK in this dialog box will create the required sample file and function automatically for the files to combine.

It is also possible to create your own function to combine the files which does not involve writing M code from scratch but rather ingesting (and applying any transformations) to one of the files, creating a function from it and then invoking this function on files in the folder

Merging Queries

When you have identical files split only across a different dimension value (e.g. a date) then appending them is the obvious choice. The other option to combine queries together is to merge them – bring in one or more columns from Query 2 for where values in one or more columns match between Query 1 and Query 2.

The Merge queries functionality works like an advanced lookup in Excel with the difference that in Power BI you can join on multiple columns simultaneously, return multiple values from the lookup and specify a join type. In this way it is far closer to the SQL join function – indeed the types of SQL join (e.g. left/right join, inner join) are replicated in Power BI.

The different types of join are represented in the below diagram

For example, the left outer join means that all records from the table identified as the first query in the merge window (the one at the top) are automatically brought in, and wherever there is a matching row in the second query, corresponding column values from the second query can be brought in. Where there is a value in the first query with no corresponding value in the second it will return null for any corresponding column values in the second query. Where there is more than one match, all corresponding records will be brought in.

The most intuitive and easily understandable use of the merge functionality is when it is used as a lookup – to return additional attributes for dimension values.

To see this in action, let’s just consider an example where we have a lookup table containing the region for various contries.

Actuals (Left Query):

Region lookup with a single value for Canada and France:

To bring in the corresponding region we can then use the merge functionality as shown:

We use a left join in this example because it is the safest – since there is one and only one row for each country in the lookup table, our resulting Actuals table has the same number of rows (9) with exactly the same values as it did before the merge operation but with an additional column for region:

If we instead used a version of the region lookup table which didn’t have a record for Canada:

Then our resulting left lookup would return:

i.e. there are null region values against Canada because it doesn’t exist in the lookup table. This is expected and we are warned of the fact that not all our rows match when we perform the merge operation (despite the tick we can see that only 1 of the 4 match):

Using this lookup table with only one country would be potentially problematic if instead we had used an inner join for this table – i.e. return only rows where the match across the column(s) selected exist in both tables. Our resulting Actuals table would then look like:

i.e. we have lost the 3 rows of our actuals table which didn’t map to a region.

We face the opposite problem when there is more than one row in the mapping table for the same value in the column(s) we are merging on. Consider the mapping table:

A left join against this table then returns:

i.e. we have gained an additional row in our actuals table as the record for France is duplicated, once for where the corresponding region is Europe and once for where it is Western Europe. Without the right DAX or filters in our report, this will give us the wrong results – the total number of units sold in the period is 5,453 and not 7,178.

In this case the problem occurs regardless of the join type. The only way to deal with this would be to essentially make the join unique, either by removing duplicates/grouping in the lookup table or by joining on multiple columns such that the combination of columns selected have one and only one resulting row in the lookup table.

Performing merges therefore requires careful consideration of the granularity of the data. There is certainly a use case for using merges to create exploded versions of dimension tables with repeated values, namely that of denormalization which is discussed in the topic Working with Data at different grains. This, however, is on dimension tables which do not contain numerical values to be aggregated. Denormalization can also be used on fact tables to deal with complex table relationships (e.g. as an alternative to genuine many:many) though the calculations then need to make use of DISTINCTCOUNT or iterations over unique values in order to ensure correct results.

Related Articles

Design a Data Model

Rishi Sapra
0

Power BI How to build a star schema. This topic looks at how to understand what structure of data tables you need, and how to model them using a star schema When data is imported into Power BI, the resulting data model sits at its heart. It drives the kind of analysis that can be… Continue reading Design a Data Model

Read More

Example Reports

Rishi Sapra
0

Power BI Example Reports | Visual Best practice design This report was the winner of the Complex Data Preparation Challenge for the 2017 Data Insight Summit in Seattle. The challenge involved taking a number of disparate data sets from the Hawaii Tourism Authority (on visitor numbers to Hawaii) and turning them into a visualization which… Continue reading Example Reports

Read More

Write Filter-based Measures

Rishi Sapra
0

Power BI How-to: DAX | Use CALCULATE and FILTER to write Filter-based Measures The easiest way to understand the role of DAX Measures in Power BI is to think of them as advanced pivot table formulas that work in a slightly different way compared to standard formulas in the Excel grid. Power BI How-to: DAX… Continue reading Write Filter-based Measures

Read More

Leave a comment