Choosing a beacon length - AWS Database Encryption SDK

Choosing a beacon length

Our client-side encryption library was renamed to the AWS Database Encryption SDK. This developer guide still provides information on the DynamoDB Encryption Client.

When you write a new value to an encrypted field that's configured for searchable encryption, the AWS Database Encryption SDK calculates an HMAC over the plaintext value. This HMAC output is a one‐to‐one (1:1) match for the plaintext value of that field. The HMAC output is truncated so that multiple, distinct plaintext values map to the same truncated HMAC tag. These collisions, or false positives, limit an unauthorized user's ability to identify distinguishing information about the plaintext value.

The average number of false positives generated for each beacon is determined by the beacon length remaining after truncation. You only need to define beacon length when configuring standard beacons. Compound beacons use the beacon lengths of the standard beacons they're constructed from.

The beacon does not alter the encrypted state of the field. However, when you use beacons, there is an inherent tradeoff between how efficient your queries are and how much information is revealed about the distribution of your data.

The goal of searchable encryption is to reduce the performance costs associated with client-side encrypted databases by using beacons to perform queries on encrypted data. Beacons are stored alongside the encrypted fields they are calculated from. This means that they can reveal distinguishing information about the distribution of your dataset. In extreme cases, an unauthorized user might be able to analyze the information revealed about your distribution and use it to identify a field's plaintext value. Choosing the right beacon length can help mitigate these risks and preserve the confidentiality of your distribution.

Review your threat model to determine the level of security that you need. For example, the more individuals who have access to your database, but should not have access to the plaintext data, the more you might want to protect the confidentiality of your dataset distribution. To increase confidentiality, a beacon needs to generate more false positives. Increased confidentiality results in reduced query performance.

Security vs. Performance
  • A beacon length that is too long produces too few false positives and might reveal distinguishing information about the distribution of your dataset.

  • A beacon length that is too short produces too many false positives and increases the performance cost of queries because it requires a broader scan of the database.

When determining the appropriate beacon length for your solution, you must find a length that adequately preserves the security of your data without impacting the performance of your queries more than absolutely necessary. The amount of security preserved by a beacon depends on the distribution of your dataset and the correlation of the fields that your beacons are constructed from. The following topics assume that your beacons are uniformly distributed and do not contain correlated data.

Calculating beacon length

Beacon length is defined in bits and refers to the number of bits of the HMAC tag that are kept after truncation. The recommended beacon length varies depending on the dataset distribution, presence of correlated values, and your specific security and performance requirements. If your dataset is uniformly distributed, you can use the following equations and procedures to help identify the best beacon length for your implementation. These equations only estimate the average number of false positives that the beacon will produce, they do not guarantee that every unique value in your dataset will produce a specific number of false positives.

Note

The effectiveness of these equations is dependent on the distribution of your dataset. If your dataset is not uniformly distributed, see Are beacons right for my dataset?.

In general, the further your dataset is from a uniform distribution, the more you need to shorten your beacon length.

  1. Estimate the population

    The population is the expected number of unique values in the field that your standard beacon is constructed from, it is not the total expected number of values stored in the field. For example, consider an encrypted Room field that identifies the location of employee meetings. The Room field is expected to store 100,000 total values, but there are only 50 different rooms that employees can reserve for meetings. This means that the population is 50 because there only 50 possible unique values that can be stored in the Room field.

    Note

    If your standard beacon is constructed from a virtual field, the population used to calculate beacon length is the number of unique combinations created by the virtual field.

    When estimating your population, be sure to consider the projected growth of the dataset. After you have written new records with the beacon, you cannot update the beacon length. Review your threat model and any existing database solutions to create an estimate for the number of unique values you expect this field to store in the next five years.

    Your population does not need to be precise. First, identify the number of unique values in your current database, or estimate the number of unique values that you expect to store in the first year. Next, use the following questions to help you determine the projected growth of unique values over the next five years.

    • Do you expect the unique values to multiply by 10?

    • Do you expect the unique values to multiply by 100?

    • Do you expect the unique values to multiply by 1000?

    The difference between 50,000 and 60,000 unique values is not significant and they will both result in the same recommended beacon length. However, the difference between 50,000 and 500,000 unique values will significantly impact the recommended beacon length.

    Consider reviewing public data on the frequency of common data types, such as ZIP codes or last names. For example, there are 41,707 ZIP codes in the United States. The population you use should be proportional to your own database. If the ZIPCode field in your database includes data from across the entire United States, then you might define your population as 41,707, even if the ZIPCode field does not currently have 41,707 unique values. If the ZIPCode field in your database only includes data from a single state, and will only ever include data from a single state, then you might define your population as the total number of ZIP codes in that state instead of 41,704.

  2. Calculate the recommended range for the expected number of collisions

    To determine the appropriate beacon length for a given field, you must first identify an appropriate range for the expected number of collisions. The expected number of collisions represents the average expected number of unique plaintext values that map to a particular HMAC tag. The expected number of false positives for one unique plaintext value is one less than the expected number of collisions.

    We recommend that the expected number of collisions is greater than or equal to two, and less than the square root of your population. The following equations only work if your population has 16 or more unique values.

    2 ≤ number of collisions < √(Population)

    If the number of collisions is less than two, the beacon will produce too few false positives. We recommend two as the minimum number of expected collisions because it means, on average, every unique value in the field will generate at least one false positive by mapping to one other unique value.

  3. Calculate the recommended range for beacon lengths

    After identifying the minimum and maximum number of expected collisions, use the following equation to identify a range of appropriate beacon lengths.

    number of collisions = Population * 2-(beacon length)

    First, solve for beacon length where the number of expected collisions equals two (the minimum recommended number of expected collisions).

    2 = Population * 2-(beacon length)

    Then, solve for beacon length where the expected number of collisions equals the square root of your population (the maximum recommended number of expected collisions).

    √(Population) = Population * 2-(beacon length)

    We recommend rounding the output produced by this equation down to the shorter beacon length. For example, if the equation produces a beacon length of 15.6, we recommend rounding that value down to 15 bits instead of rounding up to 16 bits.

  4. Choose a beacon length

    These equations only identify a recommended range of beacon lengths for your field. We recommend using a shorter beacon length to preserve the security of your dataset whenever possible. However, the beacon length that you actually use is determined by your threat model. Consider your performance requirements as you review your threat model to determine the best beacon length for your field.

    Using a shorter beacon length reduces query performance, while using a longer beacon length decreases security. In general, if your dataset is unevenly distributed, or if you construct distinct beacons from correlated fields, you need to use shorter beacon lengths to minimize the amount of information revealed about the distribution of your datasets.

    If you review your threat model and decide that any distinguishing information revealed about the distribution of a field does not present a threat to your overall security, you might choose to use a beacon length that is longer than the recommended range you calculated. For example, if you calculated the recommended range of beacon lengths for a field as 9—16 bits, you might choose to use a beacon length of 24 bits to avoid any performance loss.

    Choose your beacon length carefully. After you have written new records with the beacon, you cannot update the beacon length.

Example

Consider a database that marked the unit field as ENCRYPT_AND_SIGN in the cryptographic actions. To configure a standard beacon for the unit field, we need to determine the expected number of false positives and beacon length for the unit field.

  1. Estimate the population

    After reviewing our threat model and current database solution, we expect the unit field to eventually have 100,000 unique values.

    This means that Population = 100,000.

  2. Calculate the recommended range for the expected number of collisions.

    For this example, the expected number of collisions should be between 2—316.

    2 ≤ number of collisions < √(Population)
    1. 2 ≤ number of collisions < √(100,000)
    2. 2 ≤ number of collisions < 316
  3. Calculate the recommended range for beacon length.

    For this example, the beacon length should be between 9—16 bits.

    number of collisions = Population * 2-(beacon length)
    1. Calculate the beacon length where the expected number of collisions equals the minimum identified in Step 2.

      2 = 100,000 * 2-(beacon length)

      Beacon length = 15.6, or 15 bits

    2. Calculate the beacon length where the expected number of collisions equals the maximum identified in Step 2.

      316 = 100,000 * 2-(beacon length)

      Beacon length = 8.3, or 8 bits

  4. Determine the beacon length appropriate for your security and performance requirements.

    For every bit below 15, the performance cost and the security double.

    • 16 bits

      • On average, each unique value will map to 1.5 other units.

      • Security: two records with the same truncated HMAC tag are 66% likely to have the same plaintext value.

      • Performance: a query will retrieve 15 records for every 10 records that you actually requested.

    • 14 bits

      • On average, each unique value will map to 6.1 other units.

      • Security: two records with the same truncated HMAC tag are 33% likely to have the same plaintext value.

      • Performance: a query will retrieve 30 records for every 10 records that you actually requested.