Friday, 15 August 2008

SQL Server Best Practices 1. Big sets, small sets

SQL Server Best Practices 1. Big sets, small sets



There are two things to remember about sets: the bigger the number of rows we’re working with the slower our query will be and the more columns we return the less efficient transfer of data will be.
I am going to use the Northgale database, for those unfamiliar with it is a script developed by Erland Sommarskog that creates a huge database based on Northwind DB (provided by Microsoft). Just to illustrate the size Northgale has about 6,000 products, 344,000 orders and 2.3 million rows in Order details. I intend to use relatively big sets in order to illustrate better the differences in performance.
Consider the following query:


SELECT *
FROM Products p
INNER JOIN [Order Details] od ON p.ProductId = od.ProductId
INNER JOIN Orders o ON o.OrderID = od.OrderID
INNER JOIN Customers c ON c.CustomerId = o.CustomerId
WHERE Discontinued = 0
AND o.RequiredDate > '05/04/1997'
AND od.Quantity * od.UnitPrice > 100



The returned set from the query above contains about 1.5 million rows and takes about 1 minute and 36 seconds to complete on my machine.

Clearly not fast enough.

The first thing to always remember is that the WHERE clauses are applied at the end. This means that before we apply the where clauses in this example we will always have to use all rows in the tables to perform the joins. For example the first join will join all products with all order details assuming every order detail has product id. Then we’ll have the entire order details table (2.3 million rows) to perform a join with orders and then another one with customers. No wonder the query takes 96 seconds!

So, what can we do about this?

The smaller the number of rows you work with the better. In the example above we should be looking at restricting the set as early as possible.


SELECT *
FROM Products p
INNER JOIN [Order Details] od ON p.ProductId = od.ProductId
AND Discontinued = 0
AND od.Quantity * od.UnitPrice > 100
INNER JOIN Orders o ON o.OrderID = od.OrderID
AND o.RequiredDate > '05/04/1997'
INNER JOIN Customers c ON c.CustomerId = o.CustomerId


What we have done with this one is to move all conditions from the where section to become conditions in the joins. And we’re trying to apply them as early as possible – ie the first time we use the table they apply to.

This reduces the execution time by 30 seconds to 1 minute and 6 seconds.


In Northgale Products table has 10 columns with total size 87 bytes per row. Order adds 200 bytes per row, Order detail 22 and Customer 268. This makes a total of 577 bytes per row. If the query returns 1.5 million rows this gives a massive 825 megabytes return set size. Let’s see what happens if we return only a small number of columns.


SELECT ProductName, CompanyName, AmountPaid = od.Quantity * od.UnitPrice
FROM Products p
INNER JOIN [Order Details] od ON p.ProductId = od.ProductId
AND Discontinued = 0
AND od.Quantity * od.UnitPrice > 100
INNER JOIN Orders o ON o.OrderID = od.OrderID
AND o.RequiredDate > '05/04/1997'
INNER JOIN Customers c ON c.CustomerId = o.CustomerId


This one only takes 20 seconds which represents about 400% improvement compared to our initial query.
As mentioned earlier I have chosen a huge set to illustrate the differences in speed of execution. In reality it is very unlikely that you will need to return 1.5 million rows. It isn’t exactly practical even for one of those large “end of year” bank reports which we used to generate years ago to allow bank officers to waste 20,000 pages to print them. And I am sure that even if you have to select more than a million rows you will most likely need to use paging.
Of course with smaller sets the improvements in performance will be smaller but the changes we made to this query when applied would be valid for any other query even though speed improvements may not be that obvious. If your set is really small the optimizer will notice that and will not use indexes as scanning your small set will be faster.
Hopefully you will find this practice useful and will remember the two advices: restrict sets as early as possible and return only the columns you need.

Friday, 8 August 2008

Optimize me!

Finally, yesterday I decided to actually read one of the newsletters I get flooded with by SqlServerCentral and then I discovered a really good tsql script to help with identifying slow queries.
The original post is here but as with everything I like I have made a few small changes and would like to post it so I can find it quickly when I need it. Oh, and I might read these newsletters a bit more often from now on ;)

Please note: The code below will only work on Sql Server 2005 or above .


USE master
GO

IF NOT EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[dba_QueryTimeDelta]') AND type in (N'P', N'PC'))
EXEC sys.sp_executesql N'CREATE PROCEDURE [dbo].[dba_QueryTimeDelta] AS SELECT ''this is just temp. please replace with actual code.'''
GO


ALTER PROC [dbo].[dba_QueryTimeDelta]
@DatabaseName nvarchar(255) = ''

AS
/*----------------------------------------------------------------------
Purpose: Identify queries that are running slower than normal , when taking into account IO volumes.
------------------------------------------------------------------------
Parameters: @DatabaseName, optional

Revision History:
13/01/2008 Ian_Stirk@yahoo.com Initial version
08/08/2008 plamen.balkanski.net Added filtering by Database name

Example Usage:
1. exec YourServerName.master.dbo.dba_QueryTimeDelta
2. exec master.dbo.dba_QueryTimeDelta 'AllItems20080808'

----------------------------------------------------------------------*/

BEGIN

-- Do not lock anything, and do not get held up by any locks.
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
-- Identify queries running slower than normal.
SELECT TOP 100 [Runs] = qs.execution_count
, [Total time] = qs.total_worker_time - qs.last_worker_time
, [Avg time] = (qs.total_worker_time - qs.last_worker_time) /(qs.execution_count - 1)
, [Last time] = qs.last_worker_time
, [Time Deviation] = (qs.last_worker_time - ((qs.total_worker_time - qs.last_worker_time) /(qs.execution_count - 1)))
, [% Time Deviation] = CASE WHEN qs.last_worker_time = 0 THEN 100 ELSE (qs.last_worker_time - ((qs.total_worker_time - qs.last_worker_time) /(qs.execution_count - 1)))* 100 END /(((qs.total_worker_time - qs.last_worker_time) /(qs.execution_count - 1)))
, [Last IO] = last_logical_reads + last_logical_writes + last_physical_reads , [Avg IO] = ((total_logical_reads + total_logical_writes + total_physical_reads) - (last_logical_reads + last_logical_writes + last_physical_reads)) / (qs.execution_count - 1)
, [Individual Query] = SUBSTRING (qt.text,qs.statement_start_offset/2, (CASE WHEN qs.statement_end_offset = -1 THEN LEN(CONVERT(NVARCHAR(MAX), qt.text)) * 2 ELSE qs.statement_end_offset END - qs.statement_start_offset)/2)
, [Parent Query] = qt.text
, [DatabaseName] = DB_NAME(qt.dbid)
INTO #SlowQueries
FROM sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_sql_text(qs.plan_handle) qt
WHERE qs.execution_count > 1
AND DB_NAME(qt.dbid) = CASE WHEN @DatabaseName='' THEN DB_NAME(qt.dbid) ELSE @DatabaseName END
ORDER BY [% Time Deviation] DESC

-- Calculate the [IO Deviation] and [% IO Deviation].
-- Negative values means we did less I/O than average.
SELECT TOP 100 [Runs] ,
[Avg time] ,
[Last time] ,
[Time Deviation] ,
[% Time Deviation] ,
[Last IO] ,
[Avg IO] ,
[IO Deviation] = [Last IO] - [Avg IO] ,
[% IO Deviation] = CASE WHEN [Avg IO] = 0 THEN 0 ELSE ([Last IO]- [Avg IO]) * 100 / [Avg IO] END ,
[Individual Query] ,
[Parent Query] ,
[DatabaseName]
INTO #SlowQueriesByIO
FROM #SlowQueries
ORDER BY [% Time Deviation] DESC
-- Extract items where [% Time deviation] less [% IO deviation] is 'large'
-- These queries are slow running, even when we take into account IO deviation.

SELECT TOP 100 [Runs] ,
[Avg time] ,
[Last time] ,
[Time Deviation] ,
[% Time Deviation] ,
[Last IO] ,
[Avg IO] ,
[IO Deviation] ,
[% IO Deviation] ,
[Impedance] = [% Time Deviation] - [% IO Deviation] ,
[Individual Query] ,
[Parent Query] ,
[DatabaseName]
FROM #SlowQueriesByIO
WHERE [% Time Deviation] - [% IO Deviation] > 20
ORDER BY [Impedance] DESC

-- Tidy up.
DROP TABLE #SlowQueries
DROP TABLE #SlowQueriesByIO

END

-- Test script
-- EXEC master.dbo.dba_QueryTimeDelta 'AllItems20080808'


Wednesday, 6 August 2008

Agile Presentations - free or paid?

I was amazed (if not shocked) to find this website today: http://www.agile-software-development.com
It is selling presentations to help with Agile.

I refuse to accept this.

The Agile community, as is obvious when you look at the search results returned by your favourite search engine, has been trying for years to give these things for free. Here's some evidence:

http://www.mountaingoatsoftware.com/presentations
www.crisp.se/henrik.kniberg/ScrumAndXpFromTheTrenches.pdf
http://richardsbraindump.blogspot.com/2007/07/agile-presentations.html

and these are just a few of thousands of examples.

And after all when trying to explain Scrum/XP or agile pracitces like user stories you need to understand it otherwise how do you expect to change others view about the way it is done? And if you do understand it then you will be able to do the presentation yourself!? Even if you struggle for ideas about your presentation simply look for other people's presentation or articles or even books(!) that are available out there. For free. Not for the "price of a light meal" (10 GBP!?)

PS. I promise to find a more positive topic for my next post ;)