Collations for T-SQL
This topic provides reference information about collations and character sets in Microsoft SQL Server 2019 and Amazon Aurora PostgreSQL, highlighting their differences and similarities. You can gain insight into how these database systems handle string management, sorting rules, and character encoding. The topic explores the various levels at which collations can be defined in SQL Server, from server-level to expression-level, and contrasts this with PostgreSQL’s approach.
Feature compatibility | AWS SCT / AWS DMS automation level | AWS SCT action code index | Key differences |
---|---|---|---|
|
|
|
SQL Server Usage
SQL Server collations define the rules for string management and storage in terms of sorting, case sensitivity, accent sensitivity, and code page mapping. SQL Server supports both ASCII and UCS-2 UNICODE data.
UCS-2 UNICODE data uses a dedicated set of UNICODE data types denoted by the prefix N: Nchar and Nvarchar. Their ASCII counterparts are CHAR
and VARCHAR
.
Choosing a collation and a character set has significant implications on data storage, logical predicate evaluations, query results, and query performance.
To view all collations supported by SQL Server, use the fn_helpcollations
function:
SELECT * FROM sys.fn_helpcollations()
Collations define the actual bitwise binary representation of all string characters and the associated sorting rules. SQL Server supports multiple collations down to the column level. A table may have multiple string columns that use different collations. Collations for non-UNICODE character sets determine the code page number representing the string characters.
UNICODE and non-UNICODE data types in SQL Server aren’t compatible. A predicate or data modification that introduces a type conflict is resolved using predefined collation precedence rules. For more information, see Collation Precedence
Collations define sorting and matching sensitivity for the following string characteristics:
-
Case
-
Accent
-
Kana
-
Width
-
Variation selector
SQL Server uses a suffix naming convention that appends the option name to the collation name. For example, the collation Azeri_Cyrillic_100_CS_AS_KS_WS_SC, is an Azeri-Cyrillic-100 collation that is case-sensitive, accent-sensitive, kana type-sensitive, width-sensitive, and has supplementary characters.
SQL Server supports three types of collation sets:
-
Windows collations use the rules defined for collations by the operating system locale where UNICODE and non-UNICODE data use the same comparison algorithms.
-
Binary collations use the binary bit-wise code for comparison. Therefore, the locale doesn’t affect sorting.
-
SQL Server collations provide backward compatibility with previous SQL Server versions. They aren’t compatible with the windows collation rules for non-UNICODE data.
You can define collations at various levels:
-
Server-level collations determine the collations used for all system databases and is the default for future user databases. While the system databases collation can’t be changed, you can specify an alternative collation as part of the
CREATE DATABASE
statement. -
Database-level collations inherit the server default unless the
CREATE DATABASE
statement explicitly sets a different collation. This collation is used as a default for allCREATE TABLE
andALTER TABLE
statements. -
Column-level collations can be specified as part of the
CREATE TABLE
orALTER TABLE
statements to override the default collation setting of your database. -
Expression-level collations can be set for individual string expressions using the COLLATE function. For example,
SELECT * FROM MyTable ORDER BY StringColumn COLLATE Latin1_General_CS_AS
.
SQL Server supports UCS-2 UNICODE only.
SQL Server 2019 adds support for UTF-8 for import and export encoding, and as database-level or column-level collation for string data. Support includes PolyBase external tables, and Always Encrypted when not used with Enclaves. For more information, see Collation and Unicode Support
Syntax
CREATE DATABASE <Database Name> [ ON <File Specifications> ] COLLATE <Collation> [ WITH <Database Option List> ];
CREATE TABLE <Table Name> ( <Column Name> <String Data Type> COLLATE <Collation> [ <Column Constraints> ]... );
Examples
The following example creates a database with a default Bengali_100_CS_AI collation.
CREATE DATABASE MyBengaliDatabase ON ( NAME = MyBengaliDatabase_Datafile, FILENAME = 'C:\Program Files\Microsoft SQL Server-\MSSQL13.MSSQLSERVER\MSSQL\DATA\MyBengaliDatabase.mdf', SIZE = 100) LOG ON ( NAME = MyBengaliDatabase_Logfile, FILENAME = 'C:\Program Files\Microsoft SQL Server-\MSSQL13.MSSQLSERVER\MSSQL\DATA\MyBengaliDblog.ldf', SIZE = 25) COLLATE Bengali_100_CS_AI;
The following example creates a table with two different collations.
CREATE TABLE MyTable ( Col1 CHAR(10) COLLATE Hungarian_100_CI_AI_SC NOT NULL PRIMARY KEY, COL2 VARCHAR(100) COLLATE Sami_Sweden_Finland_100_CS_AS_KS NOT NULL );
For more information, see Collation and Unicode support
PostgreSQL Usage
PostgreSQL supports a variety of different character sets, also known as encoding, including support for both single-byte and multi-byte languages. The default character set is specified when initializing a PostgreSQL database cluster with initdb
. Each individual database created on the PostgreSQL cluster supports individual character sets defined as part of database creation.
Note
For Amazon Relational Database Service (Amazon RDS), starting with PostgreSQL 13, the Windows version now supports obtaining version information for collations or ordering rules from the operating system.
When you query the collation in PostgreSQL running on Windows, prior to version 13 there wasn’t any value to reflect the OS collation version. For example, for PostgreSQL version 11 running on Windows, the result is shown following:
CREATE COLLATION german (provider = libc, locale = 'de_DE'); CREATE COLLATION select oid,collname,collversion from pg_collation where collprovider='c' and collname='german'; oid collname collversion 16394 german (1 row) select pg_collation_actual_version (16394); pg_collation_actual_version (1 row)
For PostgreSQL version 13 running on Windows, the result is shown following:
CREATE COLLATION german (provider = libc, locale = 'de_DE'); CREATE COLLATION select oid,collname,collversion from pg_collation where collprovider='c' and collname='german'; oid collname collversion 32769 german 1539.5,1539.5 (1 row) select pg_collation_actual_version (32769); pg_collation_actual_version 1539.5,1539.5 (1 row)
Clients can use all supported character sets. However, some client-side only characters aren’t supported for use within the server.
Unlike SQL Server, PostgreSQL doesn’t natively support an NVARHCHAR data type and doesn’t provide support for UTF-16.
Type | Function | Implementation level |
---|---|---|
Encoding |
Defines the basic rules on how alphanumeric characters are represented in binary format. For example, Unicode encoding. |
Database |
Locale |
A superset that includes |
Table-Column |
Examples
The following example creates a database named test01 which uses the Korean EUC_KR Encoding the and the ko_KR
locale.
CREATE DATABASE test01 WITH ENCODING 'EUC_KR' LC_COLLATE='ko_KR.euckr' LC_CTYPE='ko_KR.euckr' TEMPLATE=template0;
The following example shows how to view the character sets configured for each database by querying the system catalog.
select datname, datcollate, datctype from pg_database;
Changing Character Sets or Encoding
In-place modification of the database encoding isn’t recommended nor supported. Instead, export all data, create a new database with the new encoding, and import the data.
Export the data using the pg_dump
utility.
pg_dump mydb1 > mydb1_export.sql
Rename or delete a database.
ALTER DATABASE mydb1 TO mydb1_backup;
Create a new database using the modified encoding.
CREATE DATABASE mydb1_new_encoding WITH ENCODING 'UNICODE' TEMPLATE=template0;
Import data using the pg_dump
file previously created. Verify that you set your client encoding to the encoding of your old database.
PGCLIENTENCODING=OLD_DB_ENCODING psql -f mydb1_export.sql mydb1_new_encoding
The client_encoding
parameter overrides the use of PGCLIENTENCODING
.
Client-Server Character Set Conversions
PostgreSQL supports conversion of character sets between servers and clients for specific character set combinations as described in the pg_conversion
system catalog.
PostgreSQL includes predefined conversions. For more information, see Available Character Set Conversions
You can create a new conversion using the SQL command CREATE CONVERSION
.
Examples
The following example creates a conversion from UTF8 to LATIN1 using the custom myfunc1
function.
CREATE CONVERSION myconv FOR 'UTF8' TO 'LATIN1' FROM myfunc1;
The following example configures the PostgreSQL client character set.
Method 1 ======== psql \encoding SJIS Method 2 ======== SET CLIENT_ENCODING TO 'value';
View the client character set and reset it back to the default value.
SHOW client_encoding; RESET client_encoding;
Table Level Collation
PostgreSQL supports specifying the sort order and character classification behavior on a per-column level.
Example
Specify specific collations for individual table columns.
CREATE TABLE test1 (col1 text COLLATE "de_DE", col2 text COLLATE "es_ES");
Summary
Feature | SQL Server | Aurora PostgreSQL |
---|---|---|
View database character set |
|
|
Modify the database character set |
|
|
Character set granularity |
Database |
Database |
UTF8 |
Supported |
Supported |
UTF16 |
Supported |
Not Supported |
|
Supported |
Not Supported |
For more information, see Character Set Support