Database Guidelines and Practices

August 3, 2019 · Matthew

This guide contains useful guidelines that I follow when designing and managing databases. Although many of the examples are in T-SQL, I have tried to generalize the rules as much as possible so that they can apply to other platforms. At some point I will probably split this guide into platform-specific sections for rules that can’t be generalized. This post is a work in progress so I will be adding more rules often as I can. Also, please comment below with any questions or suggestions.


Best Practices

Always – Grant the least amount of privilege necessary

Being a good steward of data means making sure that data is protected and secure.

Always – Use an administrative login that is separate from the login that is normally used to log into the workstation

Privileged accounts should only be used when the need arises to limit exposure from exploits or attacks. Use runas to launch apps using the administration login.

Always – Use ANSI SQL statements and data types whenever possible, especially if portability is a requirement

This will make it easier to port the schema and code to other platforms and reduce rework.

Consider – Storing time in UTC instead of local time

Using UTC avoids ambiguity and discrepancies when manipulating or comparing date and time values. UTC dates and times can easily be converted to local dates and times when presented to the end user.

Always – Use ISO 8601 formatted dates and times and compatible data types

ISO 8601 is a standards compliant encoding scheme for dates and times with optional support for time zone offsets. Using ISO 8601 ensures interoperability between systems that exchange and store data. Another advantage, when represented as a string it retains sort ordering when compared with other ISO 8601 encoded strings. ISO 8601 is also ANSI SQL compliant.

SQL Server has several ISO 8601 compatible data types including DATETIME2, DATE, TIME, and DATETIMEOFFSET.

Avoid – Using the DATETIME or SMALLDATETIME data types. (T-SQL)

DATETIME and SMALLDATETIME are not ANSI SQL compliant types. DATETIME2 is ANSI SQL compliant and supports a higher precision and range of values than DATETIME while using the equivalent amount of storage. The level of precision can be reduced to save storage, so DATETIME2 is a suitable replacement for SMALLDATETIME.

Always – Protect against SQL injection

Building a string that uses user provided input for executing ad-hoc SQL statements is asking for trouble. Parameterize every statement that is run against the database. Rely heavily on stored procedures.

Avoid – Using quoted identifiers

Don’t use reserved names, keywords, or white-space in object or column names. Only use quoted names or brackets to reference objects or columns that use reserved keywords.

Always – Use sequences to generate auto incrementing IDs

Sequences are part of the ANSI SQL specification. Avoid using IDENTITY or AUTO_INCREMENT properties because they are platform specific.


Design Patterns

Avoid – Creating tables with excessive NULL columns

Databases are places to store facts. NULL, by definition, represents an unknown or unknowable fact. A good rule of thumb when thinking about NULLs is to determine is the fact unknown or is it optional. If a thing is being described that has a lot of attributes that are optional then they are not good attributes to use to describe that thing in the first place. Maybe they are describing a sub-type of thing or a different thing altogether and should be described in another table.

A good example for describing an unknown fact is an employee termination date. Every employee has a termination date which is not known until the event occurs so feel free to use NULL to describe this fact.

A good example for describing optional facts are employee phone numbers. Not all employees have a home, cell, or company phone. Instead of using NULL make a table that describes employee phone numbers.

Avoid – Using cursors to iterate over a row set

Cursors are expensive and resource intensive and can cause contention issues in high concurrency workloads. Relational databases have many features well suited to working with whole data sets but are very poor at working with data sets one row at a time. Try to come up with a set-based solution instead. If there is a case to truly need to iterate over the row set, then it’s better to do that work in the app or service side instead of the database side.

Always – Ensure that every table has a primary key

Really the only reason to not have a primary key is for a scratch table used for ETL processes.

Always – Use the natural key as the primary key

When a natural key is available it is almost always better to choose that as the primary key. From a data access perspective, it’s convenient because data is almost always accessed by the natural key anyway. It can save unnecessary joins by not having to lookup the key in the foreign key table first. Surrogate keys are an anti-pattern when it comes to normalization and their use should only be used in edge-cases as needed.

Always – Ensure that every table has a clustered index

Tables without a clustered index also called heaps store data in no particular order and so may have inconsistent performance when accessing data. Adding a clustered index ensures that data in the table is sorted and can be more consistently accessed. Also, maintenance is easier with tables that have a clustered index. Indexes on heap tables reference the row identifier on the table. When a heap is rebuilt the row identifier changes so all indexes on the table must also be rebuilt. Secondary indexes on clustered tables reference the clustered index key. This key doesn’t change when the clustered index is rebuilt so secondary indexes are not affected and won’t be rebuilt.

Always – Use collective names for tables and views

Tables and views are a collection of rows and so should use a plural name when being described.

Never – Add type prefixes to object or column names

This just requires more typing without adding anything useful.

Always – Use named constraints

It’s always better to name constraints rather than letting the database engine generate one. It makes maintaining constraints easier because they can be referenced by their name. Below are the naming conventions I use with examples:

Primary Keys and Unique Keys:

{Tablename$ColumnName[{$AdditionalColumns}…]}

Example:

CONSTRAINT [MyThings$ThingID] PRIMARY KEY(ThingID)

Foreign Keys:

{ForeignKeyTable$ColumnName[{$AdditionalColumns}…]_PrimaryKeyTable$ColumnName[{$AdditionalColumns}…]}

Example:

CONSTRAINT [MyStuff$ThingID_MyThings$ThingID] FOREIGN KEY (ThingID) REFERENCES dbo.MyThings(ThingID)

Check Constraints:

{TableName$ColumnName_Condition}

Example:

CONSTRAINT [MyThings$ThingPercent_MustBe0Thru100] CHECK (ThingPercent BETWEEN 0 AND 100)

Default Constraints:

{TableName$ColumnName_Default}

Example:

CONSTRAINT [MyThings$ThingID_Default] DEFAULT (NEXT VALUE FOR dbo.ThingIDs)

Always – Use names for constraints to convey useful information in the event it is violated

It is easier to troubleshoot constraint violations when it is clear what condition was violated.

Example:

The INSERT statement conflicted with the CHECK constraint "MyThings$ThingPercent_MustBe0Thru100"

Avoid – Using abbreviations that are not widely used, well known, or difficult to infer

Everyone knows what a ZIP code is but what the hell is a K2SO droid?

Always – Use verb-noun when naming stored procedures and user defined functions

It should be clear to the reader what action is happening and what the target of that action is.

Example:

EXEC dbo.GetThing @ThingID = 1;

Avoid – Using non-deterministic date/time expressions. (T-SQL)

Comparing a DATE value with literal string representing a date/time such as ’01/04/2000 10:24:00AM’ is non-deterministic because it both relies on implicit type conversion and it can be affected by localization settings on the host machine or instance. There are several ways to ensure that a date/time value is handled deterministically.

ODBC Date/Time Literals:

{ts'2000-01-04 10:24:00'}
{d'2000-01-04'}

Explicit Conversion by Specifying Its Format:

CONVERT(DATE, '2000-01-04 10:24:00', 126)
CONVERT(DATE, '2000-01-04', 126)
CONVERT(DATE, '10:24:00', 126)

Always – Maintain a separation of concerns between the database and application layers

Developers discovered the benefits of having different layers of abstraction between how information is modeled, interacted with, and presented. A similar separation should exist between the application and database. A database management system should only be concerned about information storage and retrieval. This is the space where DBMSs are excellent. Maintaining complex business rules that can’t be described relationally should be a concern for the application not the database.

Avoid – Using entity-attribute-value (EAV) modeling

This is a bad design pattern in disguise. To novice designers it looks elegant and sexy at first glance, but those good looks sure fade fast. The problem usually starts with someone wanting to have an easy way to implement “custom fields”, or they want a way to create “dynamic schemas”.

Inevitably someone will claim to have come up with a simple and brilliant solution. They will propose some variation of a design that uses 3 or more tables. The first table contains columns FieldID and FieldName. The second table contains columns ValueID and Value, maybe more depending on how “Brilliant” your guy thinks he is. The Value is usually some kind of LOB data type because it needs to store any kind of value depending on the Field. The third contains the columns EntityID, FieldID, and ValueID. Now, everyone is excited because they have this “hyper-normalized” solution.

Before I go into the problems of the design, I want to tell you why it’s a bad design pattern. First, this “hyper-normalized” solution is not normalized at all. Normalization isn’t about moving columns into more and different tables, that’s just the outcome. Normalization is about removing the types of redundancies that cause data anomalies. EAV modeling is an anti-pattern to normalization.

EAV modeling throws the relational model out the window and says F this noise. Instead of an attribute having a type it says an attribute is a type. Instead of a value having to conform to the type and constraints of the attribute a value can be anything at all.

Avoid – Writing user defined functions that are non-deterministic

User defined functions can be very handy but also can be very dangerous from a performance point of view. Non-deterministic functions can be even more dangerous. Non-deterministic user defined functions often use design patterns that should be avoided. Non-deterministic functions will very often confuse the compiler and cause bad execution plans when they are called in queries.

In T-SQL using WITH SCHEMABINDING when creating the function will cause SQL Server to evaluate if the function is deterministic and if not will not allow it to be created.


Code Quality

Everyone has their own preferences regarding how they like to structure their code. I will not get into tabs vs spaces or conventions regarding camel case, pascal case, snake case. I prefer to use tabs and Pascal case, but I think consistency is more important so if I inherit a system that uses a different convention, I will use that convention instead. I will say that people who use FULL_CAPS_WITH_UNDERSCORES need to hit themselves over the head with their keyboard.

I tend to favor formatting rules that are text editor friendly with the assumption that code formatting tools may not always be available. These rules should be able to be applied as code is typed with very little reworking of whitespace and horizontal alignment.

Always – Use full caps when using keywords, built-in functions, and built-in variables

Using full caps helps distinguish functional portions of SQL code from the declarative elements and objects. Built-in functions refer to the functions that are part of the SQL language. For user-defined functions that are system created within the databases use the same casing as they are declared.

Examples:

SELECT 'Hello world.'
SELECT SYSDATETIME()
SELECT @@SERVERNAME

Always – Use a semicolon at the end of a statement

Terminating statements with a semicolon let’s readers know when the end of the statement has been reached and is required by many SQL dialects.

Example:

SELECT 'Hello world.';

Always – Add an alias for every returned column or object name using the AS keyword

Adding an alias using the AS keyword makes it clearer to the reader.

Examples:

SELECT 'Hello world.' AS Greeting;
SELECT ThingID
FROM MyThings AS t;

Never – Alias column names using an equals sign

Some SQL dialects allow aliases to be declared like this:

Bad Example:

SELECT Greeting = 'Hello world.';

This is an older style and somewhat archaic. It can be confusing since the equals sign is also used for both logical comparison and value assignment operators.

Always – Make a new line when introducing a new keyword or clause

Adding a new line helps break down the different functional elements in a statement.

Example:

SELECT ThingID
FROM MyThings;

Always – Use the two-part name (schema.object) to identify database objects

There reasons are myriad and well documented. It’s better to get into the habit.

Examples:

SELECT ThingID
FROM dbo.MyThings;
EXEC dbo.MyProcedure 'Param1';

Always – Introduce additional elements of a column set, parameter set, or query predicate on a new line and indent

It’s easier to read code down than it is across. Indenting makes it clear that it is a continuation of the previous clause.

Examples:

SELECT ThingID
	, ThingType
FROM dbo.MyThings
WHERE ThingName = 'something'
	AND CreatedDate >= '2019-01-01';
EXEC dbo.MyProcedure 'Param1'
	, 'Param2';

Always – Separate individual statements with a single line break

This makes it clear when a new statement is beginning.

Example:

DECLARE @Greeting VARCHAR(20) = 'Hello world.';

SELECT @Greeting AS Greeting;

Always – Indent code within functional blocks, subqueries, or statements within brackets

This helps to visually distinguish between the inner and outer blocks of code.

Examples:

BEGIN
	DECLARE @Greeting VARCHAR(20) = 'Hello world.';

	SELECT @Greeting AS Greeting;
END;
SELECT ThingID
FROM dbo.MyThings
WHERE ThingID IN(
	SELECT ThingID
	FROM dbo.MyStuff
);
SELECT ThingID
FROM dbo.MyThings
WHERE ThingName = 'something'
	AND (
		CreatedDate >= '2019-01-01'
		OR CreatedDate IS NULL
	);

Always – Wrap procedural code with an outer BEGIN and END block when creating a stored procedure or user-defined function

Having an outer BEGIN and END block makes it clear where the procedure begins and ends.

Example:

CREATE PROCEDURE dbo.MyProcedure
AS
BEGIN
	-- Procedure Code...
END;

Always – Pass stored procedure parameters by name

This helps a little bit when debugging.

Example:

EXEC dbo.MyProcedure @Param1 = 'parameter';

Always – Specify a column set in SELECT and INSERT statements

Being explicit with the column name in SELECT and INSERT queries avoids ambiguous or inconsistent behaviors from table definition changes.

Examples:

SELECT ThingID, ThingName
FROM dbo.MyThings;
INSERT dbo.MyThings (ThingID, ThingName)
VALUES (1, 'something');

Always – Use ANSI-92 style joins

ANSI-92 style join syntax and behavior are supported by all major database platforms. Many platforms no longer support the older join syntax.

Example:

SELECT ThingID
	, StuffID
FROM dbo.MyThings AS t
JOIN dbo.MyStuff AS s ON t.ThingID = s.ThingID;

Always – Introduce join predicates on a new line and indent

This helps distinguish the join criteria from the tables that participate in the join.

Example:

SELECT ThingID
	, StuffID
FROM dbo.MyThings AS t
JOIN dbo.MyStuff AS s 
	ON t.ThingID = s.ThingID;

Always – Separate function parameters with a space after the comma

This helps the reader distinguish the terms that are being passed.

Example:

COALESCE(Column1, Column2, Column3)

Miscellaneous

Always – If these rules need to be broken make sure that the reasoning is clearly understood by everyone involved

Some database systems may not support certain standards and features or may implement them in non-compliant ways. There may be little that can be done when deploying third-party developed schemas or software to make them meet these guidelines. It may be best to isolate these solutions from the rest of the systems if security or performance is a concern. Whatever the case, make sure all stakeholders are aware of the risks involved in supporting non-preferred solutions and try to mitigate those risks as best as possible.

Always – If it feels hacky then it’s probably a hack

Relational database systems are very good at managing data that can be expressed relationally; managing semi-structured and non-structured data, not so much. Many times, if something feels like a hack it’s because it’s not the best design or technology to use to solve the problem or the relational database engine is being used to solve a problem it wasn’t designed to solve. Fortunately, there are technologies available for managing these other use cases for data.

Always – Before investing in actual design work, revisit the problem that needs to be solved

This is an exercise that is good to do as often as possible. It’s easy to end up chasing a rabbit down hole and end up nowhere closer to a solution. When talking with stakeholders make sure that they are clearly stating the problem that needs be solved and not simply the solution they want implemented.

Leave a Reply

Your email address will not be published. Required fields are marked *