Searchable encryption - AWS Database Encryption SDK

Searchable encryption

Our client-side encryption library was renamed to the AWS Database Encryption SDK. This developer guide still provides information on the DynamoDB Encryption Client.

Searchable encryption enables you to search encrypted records without decrypting the entire database. This is accomplished using beacons, which create a map between the plaintext value written to a field and the encrypted value that is actually stored in your database. The AWS Database Encryption SDK stores the beacon in a new field that it adds to the record. Depending on the type of beacon you use, you can perform exact match searches or more customized complex queries on your encrypted data.

Note

Searchable encryption in the AWS Database Encryption SDK differs from the searchable symmetric encryption defined in academic research, such as searchable symmetric encryption.

A beacon is a truncated Hash-Based Message Authentication Code (HMAC) tag that creates a map between the plaintext and encrypted values of a field. When you write a new value to an encrypted field that's configured for searchable encryption, the AWS Database Encryption SDK calculates an HMAC over the plaintext value. This HMAC output is a one‐to‐one (1:1) match for the plaintext value of that field. The HMAC output is truncated so that multiple, distinct plaintext values map to the same truncated HMAC tag. These false positives limit an unauthorized user's ability to identify distinguishing information about the plaintext value. When you query a beacon, the AWS Database Encryption SDK automatically filters out these false positives and returns the plaintext result of your query.

The average number of false positives generated for each beacon is determined by the beacon length remaining after truncation. For help determining the appropriate beacon length for your implementation, see Determining beacon length.

Note

Searchable encryption is designed to be implemented in new, unpopulated databases. Any beacon configured in an existing database will only map new records uploaded to the database, there is no way for a beacon to map existing data.

Are beacons right for my dataset?

Using beacons to perform queries on encrypted data reduces the performance costs associated with client-side encrypted databases. When you use beacons, there is an inherent tradeoff between how efficient your queries are and how much information is revealed about the distribution of your data. The beacon does not alter the encrypted state of the field. When you encrypt and sign a field with the AWS Database Encryption SDK, the plaintext value of the field is never exposed to the database. The database stores the randomized, encrypted value of the field.

Beacons are stored alongside the encrypted fields they are calculated from. This means that even if an unauthorized user cannot view the plaintext values of an encrypted field, they might be able to perform statistical analysis on the beacons to learn more about the distribution of your dataset, and, in extreme cases, identify the plaintext values that a beacon maps to. The way you configure your beacons can mitigate these risks. In particular, choosing the right beacon length can help you preserve the confidentiality of your dataset.

Security vs. Performance
  • The shorter the beacon length, the more security is preserved.

  • The longer the beacon length, the more performance is preserved.

Searchable encryption might not be able to provide the desired levels of both performance and security for all datasets. Review your threat model, security requirements, and performance needs before configuring any beacons.

Consider the following dataset uniqueness requirements as you determine whether searchable encryption is right for your dataset.

Distribution

The amount of security preserved by a beacon depends on the distribution of your dataset. When you configure an encrypted field for searchable encryption, the AWS Database Encryption SDK calculates an HMAC over the plaintext values written to that field. All of the beacons calculated for a given field are calculated using the same key, with the exception of multitenant databases that use a distinct key for each tenant. This means that if the same plaintext value is written to the field multiple times, the same HMAC tag is created for every instance of that plaintext value.

You should avoid constructing beacons from fields that contain very common values. For example, consider a database that stores the address of every resident of the state of Illinois. If you construct a beacon from the encrypted City field, the beacon calculated over "Chicago" will be overrepresented due to the large percentage of the Illinois population that lives in Chicago. Even if an unauthorized user can only read the encrypted values and beacon values, they might be able to identify which records contain data for residents of Chicago if the beacon preserves this distribution. To minimize the amount of distinguishing information revealed about your distribution, you must sufficiently truncate your beacon. The beacon length required to hide this uneven distribution has significant performance costs that might not meet the needs of your application.

You must carefully analyze the distribution of your dataset to determine how much your beacons need to be truncated. The beacon length remaining after truncation directly correlates to the amount of statistical information that can be identified about your distribution. You might need to choose shorter beacon lengths to sufficiently minimize the amount of distinguishing information revealed about your dataset.

In extreme cases, you cannot calculate a beacon length for an unevenly distributed dataset that effectively balances performance and security. For example, you should not construct a beacon from a field that stores the result of a medical test for a rare disease. Since NEGATIVE results are expected to be significantly more prevalent within the dataset, POSITIVE results can be easily identified by how rare they are. It is very challenging to hide the distribution when the field only has two possible values. If you use a beacon length that is short enough to hide the distribution, all plaintext values map to the same HMAC tag. If you use a longer beacon length, it is obvious which beacons map to plaintext POSITIVE values.

Correlation

We strongly recommend that you avoid constructing distinct beacons from fields with correlated values. Beacons constructed from correlated fields require shorter beacon lengths to sufficiently minimize the amount of information revealed about the distribution of each dataset to an unauthorized user. You must carefully analyze your dataset, including its entropy and the joint distribution of correlated values, to determine how much your beacons need to be truncated. If the resulting beacon length does not meet your performance needs, then beacons might not be a good fit for your dataset.

For example, you should not construct two separate beacons from City and ZIPCode fields because the ZIP code will likely be associated with just one city. Typically, the false positives generated by a beacon limit an unauthorized user's ability to identify distinguishing information about your dataset. But the correlation between the City and ZIPCode fields means that an unauthorized user can easily identify which results are false positives and distinguish the different ZIP codes.

You should also avoid constructing beacons from fields that contain the same plaintext values. For example, you should not construct a beacon from mobilePhone and preferredPhone fields because they likely hold the same values. If you construct distinct beacons from both fields, the AWS Database Encryption SDK creates the beacons for each field under different keys. This results in two different HMAC tags for the same plaintext value. The two distinct beacons are unlikely to have the same false positives and an unauthorized user might be able to distinguish different phone numbers.

Even if your dataset contains correlated fields or has an uneven distribution, you might be able to construct beacons that preserve the confidentiality of your dataset by using shorter beacon lengths. However, beacon length does not guarantee that every unique value in your dataset will produce a number of false positives that effectively minimizes the amount of distinguishing information revealed about your dataset. Beacon length only estimates the average number of false positives produced. The more unevenly distributed your dataset, the less effective beacon length is at determining the average number of false positives produced.

Carefully consider the distribution of the fields you construct beacons from and consider how much you will need to truncate the beacon length to meet your security requirements. The following topics in this chapter assume that your beacons are uniformly distributed and do not contain correlated data.

Searchable encryption scenario

The following example demonstrates a simple searchable encryption solution. In application, the example fields used in this example might not meet the distribution and correlation uniqueness recommendations for beacons. You can use this example for reference as you read about the searchable encryption concepts in this chapter.

Consider a database named Employees that tracks employee data for a company. Each record in the database contains fields called EmployeeID, LastName, FirstName, and Address. Each field in the Employees database is identified by the primary key EmployeeID.

The following is an example plaintext record in the database.

{ "EmployeeID": 101, "LastName": "Jones", "FirstName": "Mary", "Address": { "Street": "123 Main", "City": "Anytown", "State": "OH", "ZIPCode": 12345 } }

If you marked the LastName and FirstName fields as ENCRYPT_AND_SIGN in your cryptographic actions, the values in these fields are encrypted locally before they're uploaded to the database. The encrypted data that is uploaded is fully randomized, the database doesn't recognize this data as being protected. It just detects typical data entries. This means that the record that is actually stored in the database might look like the following.

{ "PersonID": 101, "LastName": "1d76e94a2063578637d51371b363c9682bad926cbd", "FirstName": "21d6d54b0aaabc411e9f9b34b6d53aa4ef3b0a35", "Address": { "Street": "123 Main", "City": "Anytown", "State": "OH", "ZIPCode": 12345 } }

If you need to query the database for exact matches in the LastName field, configure a standard beacon named LastName to map the plaintext values written to the LastName field to the encrypted values stored in the database.

This beacon calculates HMACs from the plaintext values in the LastName field. Each HMAC output is truncated so that it is no longer an exact match for the plaintext value. For example, the complete hash and the truncated hash for Jones might look like the following.

Complete hash

2aa4e9b404c68182562b6ec761fcca5306de527826a69468885e59dc36d0c3f824bdd44cab45526f70a2a18322000264f5451acf75f9f817e2b35099d408c833

Truncated hash

b35099d408c833

After the standard beacon is configured, you can perform equality searches on the LastName field. For example, if you want to search for Jones, use the LastName beacon to perform the following query.

LastName = Jones

The AWS Database Encryption SDK automatically filters out the false positives and returns the plaintext result of your query.