There are two things to remember about sets: the bigger the number of rows we’re working with the slower our query will be and the more columns we return the less efficient transfer of data will be.
I am going to use the Northgale database, for those unfamiliar with it is a script developed by Erland Sommarskog that creates a huge database based on Northwind DB (provided by Microsoft). Just to illustrate the size Northgale has about 6,000 products, 344,000 orders and 2.3 million rows in Order details. I intend to use relatively big sets in order to illustrate better the differences in performance.
Consider the following query:
SELECT *
FROM Products p
INNER JOIN [Order Details] od ON p.ProductId = od.ProductId
INNER JOIN Orders o ON o.OrderID = od.OrderID
INNER JOIN Customers c ON c.CustomerId = o.CustomerId
WHERE Discontinued = 0
AND o.RequiredDate > '05/04/1997'
AND od.Quantity * od.UnitPrice > 100
The returned set from the query above contains about 1.5 million rows and takes about 1 minute and 36 seconds to complete on my machine.
Clearly not fast enough.
The first thing to always remember is that the WHERE clauses are applied at the end. This means that before we apply the where clauses in this example we will always have to use all rows in the tables to perform the joins. For example the first join will join all products with all order details assuming every order detail has product id. Then we’ll have the entire order details table (2.3 million rows) to perform a join with orders and then another one with customers. No wonder the query takes 96 seconds!
So, what can we do about this?
The smaller the number of rows you work with the better. In the example above we should be looking at restricting the set as early as possible.
SELECT *
FROM Products p
INNER JOIN [Order Details] od ON p.ProductId = od.ProductId
AND Discontinued = 0
AND od.Quantity * od.UnitPrice > 100
INNER JOIN Orders o ON o.OrderID = od.OrderID
AND o.RequiredDate > '05/04/1997'
INNER JOIN Customers c ON c.CustomerId = o.CustomerId
What we have done with this one is to move all conditions from the where section to become conditions in the joins. And we’re trying to apply them as early as possible – ie the first time we use the table they apply to.
This reduces the execution time by 30 seconds to 1 minute and 6 seconds.
In Northgale Products table has 10 columns with total size 87 bytes per row. Order adds 200 bytes per row, Order detail 22 and Customer 268. This makes a total of 577 bytes per row. If the query returns 1.5 million rows this gives a massive 825 megabytes return set size. Let’s see what happens if we return only a small number of columns.
SELECT ProductName, CompanyName, AmountPaid = od.Quantity * od.UnitPrice
FROM Products p
INNER JOIN [Order Details] od ON p.ProductId = od.ProductId
AND Discontinued = 0
AND od.Quantity * od.UnitPrice > 100
INNER JOIN Orders o ON o.OrderID = od.OrderID
AND o.RequiredDate > '05/04/1997'
INNER JOIN Customers c ON c.CustomerId = o.CustomerId
This one only takes 20 seconds which represents about 400% improvement compared to our initial query.
As mentioned earlier I have chosen a huge set to illustrate the differences in speed of execution. In reality it is very unlikely that you will need to return 1.5 million rows. It isn’t exactly practical even for one of those large “end of year” bank reports which we used to generate years ago to allow bank officers to waste 20,000 pages to print them. And I am sure that even if you have to select more than a million rows you will most likely need to use paging.
Of course with smaller sets the improvements in performance will be smaller but the changes we made to this query when applied would be valid for any other query even though speed improvements may not be that obvious. If your set is really small the optimizer will notice that and will not use indexes as scanning your small set will be faster.
Hopefully you will find this practice useful and will remember the two advices: restrict sets as early as possible and return only the columns you need.