PMsquare

View Original

Join Query Optimization in Data Modules

Improve Performance

Optimization of joined queries in Cognos Analytics is a useful way to improve performance by reducing the data returned by the database. Joins with one-to-many or many-to-one cardinality use key values from the table on the one side of the join to filter rows in the table on the many sides of the join. Like a developer who uses IN, BETWEEN, or a subquery in their custom SQL query, so can the Cognos query engine.

Join optimization in models and reports was introduced in 10.2.1. In Report Studio, it was available in the Query Explorer in the properties of a join:

In Framework Manager, you could set it on the join in Properties > (DQM) Filter Types:

It is also a Filter Type property in a Cognos Analytics report:

Join optimization in Data Modules is defined in the relationship between two tables, and a few more options are included:

What's the difference?

The three types of filters are (in FM, Report Studio, and report authoring in CA):

  • IN: the generated filter in the SQL is an IN predicate comprised of a set of values of the join keys from the one side.

  • BETWEEN: the generated filter is a BETWEEN predicate comprised of the minimum and maximum of the join key values from the one side.

  • TABLE: the generated filter is an IN predicate containing a list of join keys from the one side, but the list is compiled by a subquery.

In data modules, the names have changed but they work the same way.

  • No filtering is the same as (Default) or None.

  • The Unique values option is the same as IN.

  • The Range of values option is the same as BETWEEN.

  • Unique values in a subquery setting are the same as TABLE.

To illustrate this I built a data module that contains two tables, Product and Order Details. They have a relationship defined that looks like this:

The join keys for these two tables are Product.Product Number and Order Details.Product Number.

No filtering

If I do not set the join optimization filter, this is the SQL I get when I pull in Product.Product Type Code and Order Details.Quantity into a list.

SELECT
    "PRODUCT0"."PRODUCT_TYPE_CODE" AS "Product_Type_Code", 
    SUM("ORDER_DETAILS0"."QUANTITY") AS "Quantity"
FROM
    "GOSALES"."gosales"."PRODUCT" "PRODUCT0"
INNER JOIN "GOSALES"."gosales"."ORDER_DETAILS" "ORDER_DETAILS0"
    ON "PRODUCT0"."PRODUCT_NUMBER" = "ORDER_DETAILS0"."PRODUCT_NUMBER" 
GROUP BY
    "PRODUCT0"."PRODUCT_TYPE_CODE"

Unique values (IN)

With the Unique values setting, a list of unique join key values from the query the one side is used to filter the query on the many side. (In the case of a one-to-one relationship the filter is applied to the second table.)

Using the example above, if I change the filter optimization in the join in the data module to Unique values, the SQL is very different. We are filtering the many side of the join (Order Details) by every unique value for Product Number (the join key) in the Products table.

WITH
"ORDER_DETAILS0" AS
    (
    SELECT
 "ORDER_DETAILS"."PRODUCT_NUMBER" AS "PRODUCT_NUMBER",
 "ORDER_DETAILS"."QUANTITY" AS "QUANTITY",
 "ORDER_DETAILS"."ORDER_DETAIL_CODE" AS "ORDER_DETAIL_CODE", 
 "ORDER_DETAILS"."ORDER_NUMBER" AS "ORDER_NUMBER",
 "ORDER_DETAILS"."SHIP_DATE" AS "SHIP_DATE",
 "ORDER_DETAILS"."PROMOTION_CODE" AS "PROMOTION_CODE",
 "ORDER_DETAILS"."UNIT_COST" AS "UNIT_COST",
 "ORDER_DETAILS"."UNIT_PRICE" AS "UNIT_PRICE",
 "ORDER_DETAILS"."UNIT_SALE_PRICE" AS "UNIT_SALE_PRICE"
    FROM
 "GOSALES"."gosales"."ORDER_DETAILS" "ORDER_DETAILS"
    WHERE
 "ORDER_DETAILS"."PRODUCT_NUMBER" IN (
     1110, 2110, 3110, 4110, 5110, 6110, 7110, 8110, 9110, 10110, 11110, 12110,  13110, 14110, 15110, 16110, 17110, 18110, 19110, 20110, 21110, 22110, 23110, 24110, 25110, 26110, 27110, 28110, 29110, 30110, 31110, 32110, 33110, 34110, 35110, 36110, 37110, 38110, 39110, 40110, 41110, 42110, 43110, 44110, 45110, 46110, 47110, 48110, 49110, 50110, 51110, 52110, 53110, 54110, 55110, 56110, 57110, 58110, 59110, 60110, 61110, 62110, 63110, 63130, 63140, 64110, 65110, 65120, 66110, 67110, 68110, 68120, 68190, 68200, 68220, 68230, 68240, 68250, 69110, 69120, 70110, 70120, 70140, 70160, 70200, 70240, 70250, 71110, 72110, 73110, 74110, 75110, 76110, 77110, 78110, 79110, 80110, 81110, 82110, 83110, 84110, 85110, 86110, 87110, 88110, 89110, 90110, 91110, 92110, 93110, 94110, 95110, 96110, 97110, 98110, 99110, 100110, 101110, 102110, 103110, 104110, 105110, 106110, 107110, 108110, 109110, 110110, 111110, 112110, 113110, 114110, 115110, 122110, 122120, 122130, 122140, 122150, 123110, 123120, 123130, 123140, 123150, 123160, 124110, 124120, 124130, 124140, 124150, 124160, 124170, 124180, 124190, 125110, 125120, 125130, 125140, 125150, 125160, 125170, 126110, 126120, 126130, 126140, 126150, 127110, 127120, 127130, 127140, 127150, 128110, 128120, 128130, 128140, 128150, 128160, 128170, 128180, 128190, 128200, 128210, 129110, 129120, 129130, 129140, 129150, 129160, 129170, 129180, 130110, 130120, 130130, 130140, 130150, 131110, 131120, 131130, 132110, 132120, 132130, 132140, 132150, 132160, 132170, 133110, 134110, 134120, 134130, 134140, 135110, 135120, 135130, 136110, 136120, 136130, 136140, 136150, 140110, 141110, 142110, 143110, 143120, 143130, 143140, 143150, 144110, 144120, 144130, 144140, 144150, 144160, 144170, 144180, 144190, 144200, 145110, 145120, 145130, 145140, 145150, 145160, 145170, 145180, 146110, 146120, 146130, 146140, 147110, 149150, 149160, 151110, 151120, 151130, 152110, 153110,  154110, 154120, 154130, 154140, 154150 )
    ),
"ORDER_DETAILS_1" AS
    (
    SELECT
 "ORDER_DETAILS0"."PRODUCT_NUMBER" AS "PRODUCT_NUMBER",
 "ORDER_DETAILS0"."QUANTITY" AS "QUANTITY"
    FROM
 "ORDER_DETAILS0"
    )
SELECT
    "PRODUCT_1"."PRODUCT_TYPE_CODE" AS "Product_Type_Code",
    SUM("ORDER_DETAILS_1"."QUANTITY") AS "Quantity"
FROM
    "GOSALES"."gosales"."PRODUCT" "PRODUCT_1"
 INNER JOIN "ORDER_DETAILS_1"
 ON "PRODUCT_1"."PRODUCT_NUMBER" = "ORDER_DETAILS_1"."PRODUCT_NUMBER" 
GROUP BY
    "PRODUCT_1"."PRODUCT_TYPE_CODE"

In this example, because we are pulling from the same source system, we are probably not going to see much of a difference in run time for the query. As this is a sample source of data, all the products are represented in the Order Details table. The filter will not make a big difference (and could actually slow it down).

When would I use Unique values?

Where this type of filter optimization will really help is if we are using different sources of data. Let's say for example we have an Excel sheet where we have selected our highest revenue-generating products for each product line. We bring it into a data module and join it to our billion-row Order Details table in the data warehouse. Because the sources are different, when querying for a report that joins the two sources, Cognos will run two separate queries – one for the spreadsheet and one for the enormous orders table. If we use a Unique values filter, it will take the unique product numbers from the product spreadsheet and use them to filter the orders warehouse table by those values.

Excel query:

SELECT
    'Page1_10'.'Product_number', 
    'Page1_10'.'Product_line', 
    'Page1_10'.'Product_type', 
    'Page1_10'.'Product', 
    'Page1_10'.'Revenue'
FROM
      'GO_sales_for_Joins_-_top_5_products_for_each_line_xlsx_Page1_1' 'Page1_10'

Order Details table query:

SELECT
    "ORDER_DETAILS"."PRODUCT_NUMBER" AS "PRODUCT_NUMBER", 
    "ORDER_DETAILS"."QUANTITY" AS "QUANTITY", 
    "ORDER_DETAILS"."UNIT_SALE_PRICE" AS "UNIT_SALE_PRICE"
FROM
    "GOSALES"."gosales"."ORDER_DETAILS" "ORDER_DETAILS" 
WHERE 
    "ORDER_DETAILS"."PRODUCT_NUMBER" IN ( 
 13110, 26110, 106110, 44110, 90110, 86110, 97110, 144180, 85110, 11110, 12110, 105110, 102110, 107110, 62110, 48110, 45110, 94110, 73110, 145170 )

In this situation, we should see a faster query return from the Order Details query since we are only focusing on a few products.

Range of values (BETWEEN)

The Range of values filters the many side with a BETWEEN operator using the minimum value and maximum value applied to the join keys from the one side.

So, in our first example of Product and Order Details from the same source, when the SQL is generated, we should see this (remember – Product is on the one side and Order Details is on the many side):

WITH
"ORDER_DETAILS0" AS
    (
    SELECT
        "ORDER_DETAILS"."PRODUCT_NUMBER" AS "PRODUCT_NUMBER",
        "ORDER_DETAILS"."QUANTITY" AS "QUANTITY",
        "ORDER_DETAILS"."ORDER_DETAIL_CODE" AS "ORDER_DETAIL_CODE",
        "ORDER_DETAILS"."ORDER_NUMBER" AS "ORDER_NUMBER",
        "ORDER_DETAILS"."SHIP_DATE" AS "SHIP_DATE",
        "ORDER_DETAILS"."PROMOTION_CODE" AS "PROMOTION_CODE",
        "ORDER_DETAILS"."UNIT_COST" AS "UNIT_COST",
        "ORDER_DETAILS"."UNIT_PRICE" AS "UNIT_PRICE",
        "ORDER_DETAILS"."UNIT_SALE_PRICE" AS "UNIT_SALE_PRICE"
    FROM
        "GOSALES"."gosales"."ORDER_DETAILS" "ORDER_DETAILS"
    WHERE
        "ORDER_DETAILS"."PRODUCT_NUMBER" BETWEEN 1110 AND 154150
    ),
"ORDER_DETAILS_2" AS 
    (
    SELECT
        "ORDER_DETAILS0"."PRODUCT_NUMBER" AS "PRODUCT_NUMBER", 
        "ORDER_DETAILS0"."QUANTITY" AS "QUANTITY"
    FROM
        "ORDER_DETAILS0"
    )
SELECT
    "PRODUCT_2"."PRODUCT_TYPE_CODE" AS "Product_Type_Code",
    SUM("ORDER_DETAILS_2"."QUANTITY") AS "Quantity"
FROM
    "GOSALES"."gosales"."PRODUCT" "PRODUCT_2"
        INNER JOIN "ORDER_DETAILS_2"
        ON "PRODUCT_2"."PRODUCT_NUMBER" = "ORDER_DETAILS_2"."PRODUCT_NUMBER" 
GROUP BY
    "PRODUCT_2"."PRODUCT_TYPE_CODE"

In this case, we are filtering on the range of product numbers between 1110 and 154150.

Again, with the same source, this is not particularly important to do and will likely not make a huge difference in performance.

If we look at the Excel example, we will see this:

SELECT
    'Page1_10'.'Product_number', 
    'Page1_10'.'Product_line', 
    'Page1_10'.'Product_type', 
    'Page1_10'.'Product', 
    'Page1_10'.'Revenue'
FROM
    'ft2656444632_GO_sales_for_Joins_-_top_5_products_for_each_line_xlsx_Page1_1' 'Page1_10'
 
SELECT
    "ORDER_DETAILS"."PRODUCT_NUMBER" AS "PRODUCT_NUMBER", 
    "ORDER_DETAILS"."QUANTITY" AS "QUANTITY", 
    "ORDER_DETAILS"."UNIT_SALE_PRICE" AS "UNIT_SALE_PRICE"
FROM
    "GOSALES"."gosales"."ORDER_DETAILS" "ORDER_DETAILS" 
WHERE 
    "ORDER_DETAILS"."PRODUCT_NUMBER" BETWEEN 11110 AND 145170

This one isn't going to help us with performance since the range is so broad and we are really only focusing on a few products.

When would I use Range of values?

We would probably use Range of values when there are a lot of unique values for a key column and the sources are different and where there are a lot of key values in the many-side query that are outside of the range of values in the one-side table.

It doesn't make any real sense to use Range of values with our previous two examples but if we were joining on time, then it makes sense that we may want to use Range of values.

In this example, we have a relationship between the Go Time dimension (from the Great outdoors data warehouse) and Order Details from the transactional data source.

In this data module, there is a filter on the GO Time Dim that filters on just two years (2012 and 2013).

If we use a Range of values filter, we will see the following SQL:

SELECT
    "GO_TIME_DIM0"."DAY_DATE" AS "DAY_DATE",
    "GO_TIME_DIM0"."MONTH_EN" AS "MONTH_EN"
FROM
    "GOSALESDW"."gosalesdw"."GO_TIME_DIM" "GO_TIME_DIM0"
WHERE
    "GO_TIME_DIM0"."CURRENT_YEAR" IN (
        2012,
        2013 )
SELECT
    "ORDER_DETAILS"."SHIP_DATE" AS "SHIP_DATE", 
    "ORDER_DETAILS"."QUANTITY" AS "QUANTITY"
FROM
    "GOSALES"."gosales"."ORDER_DETAILS" "ORDER_DETAILS"
WHERE
    "ORDER_DETAILS"."SHIP_DATE" BETWEEN CONVERT(DATETIME2, '2012-01-01 00:00:00') AND CONVERT(DATETIME2, '2013-12-31 00:00:00')

We are limiting the Order Details query to just the two years we are filtering on in the GO Time dimension.

We should see an optimization in the query using this filter.

Other options:

The other two options for filtering, Unique values in a subquery (Table) and Unique or range of values, work the same way. Unique values in a subquery works like Unique values but uses a subquery to define the values for the IN predicate. You may not see this in the native SQL but if you look at IBM Cognos SQL you will see that same IN statement.

Unique or range of values option is unusual:  while it ignores cardinality it can use either the either a list of unique values (IN) or a range of values (BETWEEN) from the one side to filter values from the many side.

Conclusion

I hope this clears up (even slightly) a fairly complicated topic. I will admit that I have been skimming over that part of the Relationship dialog when I'm teaching Data Modules. I'm so glad I took the time to get a deeper idea of how this all works. Please let me know in the comments if you have any questions or comments.

Next Steps

We hope you found this article informative. Be sure to subscribe to our newsletter for data and analytics news, updates, and insights that are delivered directly to your inbox.

If you have any questions or would like PMsquare to provide guidance and support for your analytics solution, contact us today.

See this gallery in the original post