Collations supported by Babelfish - Amazon Aurora

Collations supported by Babelfish

When you create an Aurora PostgreSQL DB cluster with Babelfish, you choose a collation for your data. A collation specifies the sort order and bit patterns that produce the text or characters in a given written human language. A collation includes rules comparing data for a given set of bit patterns. Collation is related to localization. Different locales affect character mapping, sort order, and the like. Collation attributes are reflected in the names of various collations. For information about attributes, see the Babelfish collation attributes table.

Babelfish maps SQL Server collations to comparable collations provided by Babelfish. Babelfish predefines Unicode collations with culturally sensitive string comparisons and sort orders. Babelfish also provides a way to translate the collations in your SQL Server DB to the closest-matching Babelfish collation. Locale-specific collations are provided for different languages and regions.

Some collations specify a code page that corresponds to a client-side encoding. Babelfish automatically translates from the server encoding to the client encoding depending on the collation of each output column.

Babelfish supports the collations listed in the Babelfish supported collations table. Babelfish maps SQL Server collations to comparable collations provided by Babelfish.

Babelfish uses version 153.80 of the International Components for Unicode (ICU) collation library. For more information about ICU collations, see Collation in the ICU documentation. To learn more about PostgreSQL and collation, see Collation Support in the the PostgreSQL documentation.

DB cluster parameters that control collation and locale

The following parameters affect collation behavior.

babelfishpg_tsql.default_locale

This parameter specifies the default locale used by the collation. This parameter is used in combination with attributes listed in the Babelfish collation attributes table to customize collations for a specific language and region. The default value for this parameter is en-US.

The default locale applies to all Babelfish collation names that start with "BBF" and to all SQL Server collations that are mapped to Babelfish collations. Changing the setting for this parameter on an existing Babelfish DB cluster doesn't affect the locale of existing collations. For the list of collations, see the Babelfish supported collations table.

babelfishpg_tsql.server_collation_name

This parameter specifies the default collation for the server (Aurora PostgreSQL DB cluster instance) and the database. The default value is sql_latin1_general_cp1_ci_as. The server_collation_name has to be a CI_AS collation because in T-SQL, the server collation determines how identifiers are compared.

When you create your Babelfish DB cluster, you choose the Collation name from the selectable list. These include the collations listed in the Babelfish supported collations table. Don't modify the server_collation_name after the Babelfish database is created.

The settings you choose when you create your Babelfish for Aurora PostgreSQL DB cluster are stored in the DB cluster parameter group associated with the cluster for these parameters and set its collation behavior.

Deterministic and nondeterministic collations and Babelfish

Babelfish supports deterministic and nondeterministic collations:

  • A deterministic collation evaluates characters that have identical byte sequences as equal. That means that x and X aren't equal in a deterministic collation. Deterministic collations can be case-sensitive (CS) and accent-sensitive (AS).

  • A nondeterministic collation doesn't need an identical match. A nondeterministic collation evaluates x and X as equal. Nondeterministic collations are case-insensitive (CI) or accent-insensitive (AI), or both.

In the table following, you can find some behavior differences between Babelfish and PostgreSQL when using nondeterministic collations.

Babelfish PostgreSQL

Supports the LIKE clause for CI_AS collations.

Doesn't support the LIKE clause on nondeterministic collations.

Supports the LIKE clause only on the following AI collations from Babelfish version 4.2.0:

  • bbf_unicode_cp1250_ci_ai

  • bbf_unicode_cp1250_cs_ai

  • bbf_unicode_cp1257_ci_ai

  • bbf_unicode_cp1257_cs_ai

  • bbf_unicode_cp1_ci_ai

  • bbf_unicode_cp1_cs_ai

  • estonian_ci_ai

  • finnish_swedish_ci_ai

  • french_ci_ai

  • latin1_general_ci_ai

  • latin1_general_cs_ai

  • modern_spanish_ci_ai

  • polish_ci_ai

  • sql_latin1_general_cp1_ci_ai

  • sql_latin1_general_cp1_cs_ai

  • traditional_spanish_ci_ai

Doesn't support the LIKE clause on nondeterministic collations.

For a list of other limitations and behavior differences for Babelfish compared to SQL Server and PostgreSQL, see Collation limitations and behavior differences.

Babelfish and SQL Server follow a naming convention for collations that describe the collation attributes, as shown in the table following.

Attribute Description

AI

Accent-insensitive.

AS

Accent-sensitive.

BIN2

BIN2 requests data to be sorted in code point order. Unicode code point order is the same character order for UTF-8, UTF-16, and UCS-2 encodings. Code point order is a fast deterministic collation.

CI

Case-insensitive.

CS

Case-sensitive.

PREF

To sort uppercase letters before lowercase letters, use a PREF collation. If comparison is case-insensitive, the uppercase version of a letter sorts before the lowercase version, if there is no other distinction. The ICU library supports uppercase preference with colCaseFirst=upper, but not for CI_AS collations.

PREF can be applied only to CS_AS deterministic collations.

Collations supported by Babelfish

Use the following collations as a server collation or an object collation.

Collation ID Notes

bbf_unicode_general_ci_as

Supports case-insensitive comparison and the LIKE operator.

bbf_unicode_cp1_ci_as

Nondeterministic collation also known as CP1252.

bbf_unicode_CP1250_ci_as

Nondeterministic collation used to represent texts in Central European and Eastern European languages that use Latin script.

bbf_unicode_CP1251_ci_as

Nondeterministic collation for languages that use the Cyrillic script.

bbf_unicode_cp1253_ci_as

Nondeterministic collation used to represent modern Greek.

bbf_unicode_cp1254_ci_as

Nondeterministic collation that supports Turkish.

bbf_unicode_cp1255_ci_as

Nondeterministic collation that supports Hebrew.

bbf_unicode_cp1256_ci_as

Nondeterministic collation used to write languages that use Arabic script.

bbf_unicode_cp1257_ci_as

Nondeterministic collation used to support Estonian, Latvian, and Lithuanian languages.

bbf_unicode_cp1258_ci_as

Nondeterministic collation used to write Vietnamese characters.

bbf_unicode_cp874_ci_as

Nondeterministic collation used to write Thai characters.

sql_latin1_general_cp1250_ci_as

Nondeterministic single-byte character encoding used to represent Latin characters.

sql_latin1_general_cp1251_ci_as

Nondeterministic collation that supports Latin characters.

sql_latin1_general_cp1_ci_as

Nondeterministic collation that supports Latin characters.

sql_latin1_general_cp1253_ci_as

Nondeterministic collation that supports Latin characters.

sql_latin1_general_cp1254_ci_as

Nondeterministic collation that supports Latin characters.

sql_latin1_general_cp1255_ci_as

Nondeterministic collation that supports Latin characters.

sql_latin1_general_cp1256_ci_as

Nondeterministic collation that supports Latin characters.

sql_latin1_general_cp1257_ci_as

Nondeterministic collation that supports Latin characters.

sql_latin1_general_cp1258_ci_as

Nondeterministic collation that supports Latin characters.

chinese_prc_ci_as

Nondeterministic collation that supports Chinese (PRC).

cyrillic_general_ci_as

Nondeterministic collation that supports Cyrillic.

finnish_swedish_ci_as

Nondeterministic collation that supports Finnish.

french_ci_as

Nondeterministic collation that supports French.

japanese_ci_as

Nondeterministic collation that supports Japanese. Supported in Babelfish 2.1.0 and higher releases.

korean_wansung_ci_as

Nondeterministic collation that supports Korean (with dictionary sort).

latin1_general_ci_as

Nondeterministic collation that supports Latin characters.

modern_spanish_ci_as

Nondeterministic collation that supports Modern Spanish.

polish_ci_as

Nondeterministic collation that supports Polish.

thai_ci_as

Nondeterministic collation that supports Thai.

traditional_spanish_ci_as

Nondeterministic collation that supports Spanish (traditional sort).

turkish_ci_as

Nondeterministic collation that supports Turkish.

ukrainian_ci_as

Nondeterministic collation that supports Ukrainian.

vietnamese_ci_as

Nondeterministic collation that supports Vietnamese.

You can use the following collations as object collations.

Dialect Deterministic options Nondeterministic options

Arabic

Arabic_CS_AS

Arabic_CI_AS

Arabic_CI_AI

Arabic script

BBF_Unicode_CP1256_CS_AS

BBF_Unicode_Pref_CP1256_CS_AS

BBF_Unicode_CP1256_CI_AI

BBF_Unicode_CP1256_CS_AI

Binary

latin1_general_bin2

BBF_Unicode_BIN2

Central European and Eastern European languages that use Latin script

BBF_Unicode_CP1250_CS_AS

BBF_Unicode_Pref_CP1250_CS_AS

BBF_Unicode_CP1250_CI_AI

BBF_Unicode_CP1250_CS_AI

Chinese

Chinese_PRC_CS_AS

Chinese_PRC_CI_AS

Chinese_PRC_CI_AI

Cyrillic_General

Cyrillic_General_CS_AS

Cyrillic_General_CI_AS

Cyrillic_General_CI_AI

Cyrillic script

BBF_Unicode_CP1251_CS_AS

BBF_Unicode_Pref_CP1251_CS_AS

BBF_Unicode_CP1251_CI_AI

BBF_Unicode_CP1251_CS_AI

Estonian

Estonian_CS_AS

Estonian_CI_AS

Estonian_CI_AI

Estonian, Latvian, and Lithuanian

BBF_Unicode_CP1257_CS_AS

BBF_Unicode_Pref_CP1257_CS_AS

BBF_Unicode_CP1257_CI_AI

BBF_Unicode_CP1257_CS_AI

Finnish_Swedish

Finnish_Swedish_CS_AS

Finnish_Swedish_CI_AS

Finnish_Swedish_CI_AI

French

French_CS_AS

French_CI_AS

French_CI_AI

Greek

Greek_CS_AS

Greek_CI_AS

Greek_CI_AI

Hebrew

BBF_Unicode_CP1255_CS_AS

BBF_Unicode_Pref_CP1255_CS_AS

Hebrew_CS_AS

BBF_Unicode_CP1255_CI_AI

BBF_Unicode_CP1255_CS_AI

Hebrew_CI_AS

Hebrew_CI_AI

Japanese (Babelfish 2.1.0 and higher)

Japanese_CS_AS

Japanese_CI_AI

Japanese_CI_AS

Korean_Wamsung

Korean_Wamsung_CS_AS

Korean_Wamsung_CI_AS

Korean_Wamsung_CI_AI

Latin characters for code page CP1252

latin1_general_cs_as

BBF_Unicode_General_CS_AS

BBF_Unicode_General_Pref_CS_AS

BBF_Unicode_Pref_CP1_CS_AS

BBF_Unicode_CP1_CS_AS

latin1_general_ci_as

latin1_general_ci_ai

latin1_general_cs_ai

BBF_Unicode_General_CI_AI

BBF_Unicode_General_CS_AI

BBF_Unicode_CP1_CI_AI

BBF_Unicode_CP1_CS_AI

Modern Greek

BBF_Unicode_CP1253_CS_AS

BBF_Unicode_Pref_CP1253_CS_AS

BBF_Unicode_CP1253_CI_AI

BBF_Unicode_CP1253_CS_AI

Modern_Spanish

Modern_Spanish_CS_AS

Modern_Spanish_CI_AS

Modern_Spanish_CI_AI

Mongolian

Mongolian_CS_AS

Mongolian_CI_AS

Mongolian_CI_AI

Polish

Polish_CS_AS

Polish_CI_AS

Polish_CI_AI

Thai

BBF_Unicode_CP874_CS_AS

BBF_Unicode_Pref_CP874_CS_AS

Thai_CS_AS

BBF_Unicode_CP874_CI_AI

BBF_Unicode_CP874_CS_AI

Thai_CI_AS, Thai_CI_AI

Traditional_Spanish

Traditional_Spanish_CS_AS

Traditional_Spanish_CI_AS

Traditional_Spanish_CI_AI

Turkish

BBF_Unicode_CP1254_CS_AS

BBF_Unicode_Pref_CP1254_CS_AS

Turkish_CS_AS

BBF_Unicode_CP1254_CI_AI

BBF_Unicode_CP1254_CS_AI

Turkish_CI_AS, Turkish_CI_AI

Ukranian

Ukranian_CS_AS

Ukranian_CI_AS

Ukranian_CI_AI

Vietnamese

BBF_Unicode_CP1258_CS_AS

BBF_Unicode_Pref_CP1258_CS_AS

Vietnamese_CS_AS

BBF_Unicode_CP1258_CI_AI

BBF_Unicode_CP1258_CS_AI

Vietnamese_CI_AS

Vietnamese_CI_AI

Default Collation in Babelfish

Earlier, the default collation of the collatable datatypes was pg_catalog.default. The datatypes and the objects that depends on these datatypes follows cases-sensitive collation. This condition potentially impacts the T-SQL objects of the data set with case-insensitive collation. Starting with Babelfish 2.3.0, the default collation for the collatable data types (except TEXT and NTEXT) is the same as the collation in the babelfishpg_tsql.server_collation_name parameter. When you upgrade to Babelfish 2.3.0, the default collation is picked automatically at the time of DB cluster creation, which doesn't create any visible impact.