Select Page

Generals are always preparing to fight the last war.

-proverb

IT professionals are always learning from our mistakes and trying to improve upon our past performance.  And like military leaders, we often get caught out using tactics that are no longer appropriate for our new environment. This can be particularly difficult for IT workers as new software or new features in existing software change our environment quickly.

There is a belief among Business Intelligence professionals that  pre-calculating  and storing calculated values in data warehouses always results in better performance at query time.  The truth is that is advice is not accurate for current versions of relational databases.  Pre-calculating values improves query performance when storing the calculated value has little effect on the row length and the calculation is complex. When the calculation increases the row length significantly or the calculation is trivial, better performance results from performing the calculation at query time, especially if the database server is I/O bound.

Example

Most data warehouses have a date dimension. This table typically stores a dozen or more columns, many of which are simple derivatives of the others. The AdventureWorksDW2012 sample database for SQL Server 2012 has the following table:


CREATE TABLE [dbo].[DimDate](
[DateKey] [int] NOT NULL,
[FullDateAlternateKey] [date] NOT NULL,
[DayNumberOfWeek] [tinyint] NOT NULL,
[EnglishDayNameOfWeek] [nvarchar](10) NOT NULL,
[SpanishDayNameOfWeek] [nvarchar](10) NOT NULL,
[FrenchDayNameOfWeek] [nvarchar](10) NOT NULL,
[DayNumberOfMonth] [tinyint] NOT NULL,
[DayNumberOfYear] [smallint] NOT NULL,
[WeekNumberOfYear] [tinyint] NOT NULL,
[EnglishMonthName] [nvarchar](10) NOT NULL,
[SpanishMonthName] [nvarchar](10) NOT NULL,
[FrenchMonthName] [nvarchar](10) NOT NULL,
[MonthNumberOfYear] [tinyint] NOT NULL,
[CalendarQuarter] [tinyint] NOT NULL,
[CalendarYear] [smallint] NOT NULL,
[CalendarSemester] [tinyint] NOT NULL,
[FiscalQuarter] [tinyint] NOT NULL,
[FiscalYear] [smallint] NOT NULL,
[FiscalSemester] [tinyint] NOT NULL,
CONSTRAINT [PK_DimDate_DateKey] PRIMARY KEY CLUSTERED
(
[DateKey] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY],
CONSTRAINT [AK_DimDate_FullDateAlternateKey] UNIQUE NONCLUSTERED
(
[FullDateAlternateKey] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

I have a populated this table with one row per day for one century (36709 rows). This is larger than an actual data warehouse would use, but the larger data volume makes the example clearer. In its default format it uses 149 bytes per row.


exec sp_spaceused N'DimDate'
name rows reserved data index_size unused
--------- ------ ---------------------- ----------------- ----------------- ------------------
DimDate 36709 6352 KB 5344 KB 936 KB 72 KB

For comparison I created a table with computed columns for the varchars.

 

CREATE TABLE [dbo].[calcLabel] (
[DateKey] [int] NOT NULL,
[FullDateAlternateKey] [date] NOT NULL,
[DayNumberOfWeek] [tinyint] NOT NULL,
[EnglishDayNameOfWeek] as CASE [DayNumberOfWeek]
WHEN 1 THEN ‘Sunday’
WHEN 2 THEN ‘Monday’
WHEN 3 THEN ‘Tuesday’
WHEN 4 THEN ‘Wednesday’
WHEN 5 THEN ‘Thursday’
WHEN 6 THEN ‘Friday’
WHEN 7 THEN ‘Saturday’
END,
[SpanishDayNameOfWeek] as CASE [DayNumberOfWeek]
WHEN 1 THEN ‘Domingo’
WHEN 2 THEN ‘Lunes’
WHEN 3 THEN ‘Martes’
WHEN 4 THEN ‘Miércoles’
WHEN 5 THEN ‘Jueves’
WHEN 6 THEN ‘Viernes’
WHEN 7 THEN ‘Sábado’
END,
[FrenchDayNameOfWeek] as CASE [DayNumberOfWeek]
WHEN 1 THEN ‘Dimanche’
WHEN 2 THEN ‘Lundi’
WHEN 3 THEN ‘Mardi’
WHEN 4 THEN ‘Mercredi’
WHEN 5 THEN ‘Jeudi’
WHEN 6 THEN ‘Vendredi’
WHEN 7 THEN ‘Samedi’
END,
[DayNumberOfMonth] [tinyint] NOT NULL,
[DayNumberOfYear] [smallint] NOT NULL,
[WeekNumberOfYear] [tinyint] NOT NULL,
[EnglishMonthName] as CASE [MonthNumberOfYear]
WHEN 1 THEN ‘January’
WHEN 2 THEN ‘February’
WHEN 3 THEN ‘March’
WHEN 4 THEN ‘April’
WHEN 5 THEN ‘May’
WHEN 6 THEN ‘June’
WHEN 7 THEN ‘July’
WHEN 8 THEN ‘August’
WHEN 9 THEN ‘September’
WHEN 10 THEN ‘October’
WHEN 11 THEN ‘November’
WHEN 12 THEN ‘December’
END,
[SpanishMonthName] as CASE [MonthNumberOfYear]
WHEN 1 THEN ‘Enero’
WHEN 2 THEN ‘Febrero’
WHEN 3 THEN ‘Marzo’
WHEN 4 THEN ‘Abril’
WHEN 5 THEN ‘Mayo’
WHEN 6 THEN ‘Junio’
WHEN 7 THEN ‘Julio’
WHEN 8 THEN ‘Agosto’
WHEN 9 THEN ‘Septiembre’
WHEN 10 THEN ‘Octubre’
WHEN 11 THEN ‘Noviembre’
WHEN 12 THEN ‘Diciembre’
END,
[FrenchMonthName] as CASE [MonthNumberOfYear]
WHEN 1 THEN ‘Janvier’
WHEN 2 THEN ‘Février’
WHEN 3 THEN ‘Mars’
WHEN 4 THEN ‘Avril’
WHEN 5 THEN ‘Mai’
WHEN 6 THEN ‘Juin’
WHEN 7 THEN ‘Juillet’
WHEN 8 THEN ‘Août’
WHEN 9 THEN ‘Septembre’
WHEN 10 THEN ‘Octobre’
WHEN 11 THEN ‘Novembre’
WHEN 12 THEN ‘Décembre’
END,
[MonthNumberOfYear] [tinyint] NOT NULL,
[CalendarQuarter] [tinyint] NOT NULL,
[CalendarYear] [smallint] NOT NULL,
[CalendarSemester] [tinyint] NOT NULL,
[FiscalQuarter] [tinyint] NOT NULL,
[FiscalYear] [smallint] NOT NULL,
[FiscalSemester] [tinyint] NOT NULL,
CONSTRAINT [PK_CalcLabel_DateKey] PRIMARY KEY CLUSTERED
(
[DateKey] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY];
GO

insert into [calcLabel] (
[DateKey]
,[FullDateAlternateKey]
,[DayNumberOfWeek]
,[DayNumberOfMonth]
,[DayNumberOfYear]
,[WeekNumberOfYear]
,[MonthNumberOfYear]
,[CalendarQuarter]
,[CalendarYear]
,[CalendarSemester]
,[FiscalQuarter]
,[FiscalYear]
,[FiscalSemester])
select [DateKey]
,[FullDateAlternateKey]
,[DayNumberOfWeek]
,[DayNumberOfMonth]
,[DayNumberOfYear]
,[WeekNumberOfYear]
,[MonthNumberOfYear]
,[CalendarQuarter]
,[CalendarYear]
,[CalendarSemester]
,[FiscalQuarter]
,[FiscalYear]
,[FiscalSemester]
from [DimDate];
GO

exec sp_spaceused N’calcLabel’

name rows reserved data index_size unused
————– ————- —————— —————— —————- ——– calcLabel 36709 1672 KB 1648 KB 16 KB 8 KB

 

We saved over 100 bytes per row (46 bytes instead of 149 bytes per row). That directly results in less I/O, less cache space used, and faster query results for queries that don’t access the computed columns.

So now our environment is ready for some testing. I’ve run my tests with SQL Server 2012 (11.0.2218) running on my laptop.

What about queries that do access those computed columns?


set statistics IO on
set statistics time on

dbcc dropcleanbuffers;

select [EnglishDayNameOfWeek], sum(len([FrenchMonthName])) as charLength
from [calcLabel]
group by [EnglishDayNameOfWeek];

select [EnglishDayNameOfWeek], sum(len([FrenchMonthName])) as charLength
from [DimDate]
group by [EnglishDayNameOfWeek];

 

Results:

 

Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘calcLabel’. Scan count 1, logical reads 208, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 47 ms, elapsed time = 23 ms.

(7 row(s) affected)
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘DimDate’. Scan count 1, logical reads 672, physical reads 2, read-ahead reads 676, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 16 ms, elapsed time = 44 ms.

 

As we should expect, the computed columns have less I/O (208 vs 672) and had a smaller elapsed time (23ms vs 44 ms), but did use more CPU (47ms vs 16ms).

What happens when all of our pages are in the cache?


(7 row(s) affected)
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'calcLabel'. Scan count 1, logical reads 208, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 31 ms, elapsed time = 26 ms.

(7 row(s) affected)
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘DimDate’. Scan count 1, logical reads 672, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 16 ms, elapsed time = 20 ms.

So our test result aligns with our expectations. The computed column version is CPU bound, so it benefits little from being cached. The stored column version is I/O bound so performs much better when cached. The number of logical reads is so much smaller for the computed columns that the overall performance is comparable from the cache.

Going too far?

The more daring reader will already have noticed that there are actually only two columns of information stored on each row, a calendar date and a fiscal period. What happens when we compute all of the other columns?

 

CREATE TABLE [dbo].[calcDate]
([FullDateAlternateKey] [date] NOT NULL
,[FiscalYearStart] [date] NOT NULL
,[DateKey] as (datepart(year,[FullDateAlternateKey])*1E4)+(datepart(month,[FullDateAlternateKey])*1E2)+(datepart(day,[FullDateAlternateKey]))
,[DayNumberOfWeek] as DATEPART(dw, [FullDateAlternateKey])
,[EnglishDayNameOfWeek] as CASE DATEPART(dw, [FullDateAlternateKey])
WHEN 1 THEN ‘Sunday’
WHEN 2 THEN ‘Monday’
WHEN 3 THEN ‘Tuesday’
WHEN 4 THEN ‘Wednesday’
WHEN 5 THEN ‘Thursday’
WHEN 6 THEN ‘Friday’
WHEN 7 THEN ‘Saturday’
END
,[SpanishDayNameOfWeek] as CASE DATEPART(dw, [FullDateAlternateKey])
WHEN 1 THEN ‘Domingo’
WHEN 2 THEN ‘Lunes’
WHEN 3 THEN ‘Martes’
WHEN 4 THEN ‘Miércoles’
WHEN 5 THEN ‘Jueves’
WHEN 6 THEN ‘Viernes’
WHEN 7 THEN ‘Sábado’
END
,[FrenchDayNameOfWeek] as CASE DATEPART(dw, [FullDateAlternateKey])
WHEN 1 THEN ‘Dimanche’
WHEN 2 THEN ‘Lundi’
WHEN 3 THEN ‘Mardi’
WHEN 4 THEN ‘Mercredi’
WHEN 5 THEN ‘Jeudi’
WHEN 6 THEN ‘Vendredi’
WHEN 7 THEN ‘Samedi’
END
,[DayNumberOfMonth] as DATEPART(d, [FullDateAlternateKey])
,[DayNumberOfYear] as DATEPART(dy, [FullDateAlternateKey])
,[WeekNumberOfYear] as DATEPART(wk, [FullDateAlternateKey])
,[EnglishMonthName] as CASE DATEPART(m, [FullDateAlternateKey])
WHEN 1 THEN ‘January’
WHEN 2 THEN ‘February’
WHEN 3 THEN ‘March’
WHEN 4 THEN ‘April’
WHEN 5 THEN ‘May’
WHEN 6 THEN ‘June’
WHEN 7 THEN ‘July’
WHEN 8 THEN ‘August’
WHEN 9 THEN ‘September’
WHEN 10 THEN ‘October’
WHEN 11 THEN ‘November’
WHEN 12 THEN ‘December’
END
,[SpanishMonthName] as CASE DATEPART(m, [FullDateAlternateKey])
WHEN 1 THEN ‘Enero’
WHEN 2 THEN ‘Febrero’
WHEN 3 THEN ‘Marzo’
WHEN 4 THEN ‘Abril’
WHEN 5 THEN ‘Mayo’
WHEN 6 THEN ‘Junio’
WHEN 7 THEN ‘Julio’
WHEN 8 THEN ‘Agosto’
WHEN 9 THEN ‘Septiembre’
WHEN 10 THEN ‘Octubre’
WHEN 11 THEN ‘Noviembre’
WHEN 12 THEN ‘Diciembre’
END
,[FrenchMonthName] as CASE DATEPART(m, [FullDateAlternateKey])
WHEN 1 THEN ‘Janvier’
WHEN 2 THEN ‘Février’
WHEN 3 THEN ‘Mars’
WHEN 4 THEN ‘Avril’
WHEN 5 THEN ‘Mai’
WHEN 6 THEN ‘Juin’
WHEN 7 THEN ‘Juillet’
WHEN 8 THEN ‘Août’
WHEN 9 THEN ‘Septembre’
WHEN 10 THEN ‘Octobre’
WHEN 11 THEN ‘Novembre’
WHEN 12 THEN ‘Décembre’
END
,[MonthNumberOfYear] as DATEPART(m, [FullDateAlternateKey])
,[CalendarQuarter] as DATEPART(q, [FullDateAlternateKey])
,[CalendarYear] as DATEPART(yyyy, [FullDateAlternateKey])
,[CalendarSemester] as ((DATEPART(q, [FullDateAlternateKey]) + 1) / 2)
,[FiscalQuarter] as DATEPART(q, [FullDateAlternateKey]) – DATEPART(q,[FiscalYearStart])
,[FiscalYear] as DATEPART(yyyy, [FiscalYearStart])
,[FiscalSemester] as ((DATEPART(q, [FullDateAlternateKey]) – DATEPART(q,[FiscalYearStart])) / 2)
CONSTRAINT [PK_CalcDate_DateKey] PRIMARY KEY CLUSTERED
(
[FullDateAlternateKey] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

I then populated this table with the same 2011 rows from DimDate.
insert into calcDate ([FullDateAlternateKey],[FiscalYearStart])
select [FullDateAlternateKey]
, dateadd(month,6, DATEADD(YEAR, DATEDIFF(YEAR, 0, [FullDateAlternateKey]), 0))

from DimDate;

 

This new table, of course, is much smaller still, at 29 bytes per row.

 

exec sp_spaceused N’calcDate’

name rows reserved data index_size unused
——— —— ———————- —————– —————– ——————
calcDate 36709 1096 KB 1056 KB 16 KB 24 KB

 

The test query reads the whole table and references three calculated columns.  All three of these columns use fairly expensive date functions.


set statistics IO on
set statistics time on

dbcc dropcleanbuffers;

select FiscalYear, FiscalQuarter, sum(DayNumberOfYear) as dayTotal
from calcDate
group by FiscalYear, FiscalQuarter;

select FiscalYear, FiscalQuarter, sum(DayNumberOfYear) as dayTotal
from DimDate
group by FiscalYear, FiscalQuarter;

 

Results

 

SQL Server parse and compile time:
CPU time = 8 ms, elapsed time = 8 ms.

(403 row(s) affected)
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘calcDate’. Scan count 1, logical reads 134, physical reads 0, read-ahead reads 22, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 15 ms, elapsed time = 28 ms.
SQL Server parse and compile time:
CPU time = 3 ms, elapsed time = 3 ms.

(403 row(s) affected)
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘DimDate’. Scan count 1, logical reads 672, physical reads 2, read-ahead reads 676, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 55 ms.

 

If we re-run the tests with cached pages/buffers/data:

 

SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 1 ms.

(403 row(s) affected)
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘calcDate’. Scan count 1, logical reads 134, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 15 ms, elapsed time = 28 ms.

(403 row(s) affected)
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘DimDate’. Scan count 1, logical reads 672, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 16 ms, elapsed time = 13 ms.

 

Summary of Results

Statistic Stored Computed
From disk – elapsed time 55 28
From disk – logical reads 672 134
From RAM – elapsed time 13 28
From RAM – logical reads 672 134

With the expensive calculations, when everything is in memory and with the hardware completely idle, the query run time is less for the stored columns. But as soon as we have to go to disk, the run time for the stored column is greatly increased by the I/O cost.

Advice

1. Never store the results of a simple arithmetic operation (adding, subtracting, multiplying,dividing).

2. If you are I/O bound, which is typical, only store results of expensive calculations.

3. If you are CPU bound,  store the result of a mathematical function when that storage increases the row length by a much smaller margin than the frequency that is is accessed. Adding a precomputed 4-byte integer to a 20-byte row is only a good idea if something like half of queries against the table use that column.

PERSIST

If you do decide to store a column that is the result of a calculation of the other columns, use a PERSIST’ed computed column. That ensures that every row in the table uses a consistent calculation. It is otherwise identical to just storing a column.

We run regular business intelligence courses in both Wellington and Auckland. Find out more here.

%d bloggers like this: