Enable Lake Formation with Amazon EMR
With Amazon EMR 6.15.0 and higher, when you run Spark jobs on Amazon EMR on EC2 clusters that access data in the AWS Glue Data Catalog, you can use AWS Lake Formation to apply table, row, column, and cell level permissions on Hudi, Iceberg, or Delta Lake based tables.
In this section, we cover how to create a security configuration and set up Lake Formation to work with Amazon EMR. We also go over how to launch a cluster with the security configuration that you created for Lake Formation.
Step 1: Set up a runtime role for your EMR cluster
To use a runtime role for your EMR cluster, you must create a security configuration. With a security configuration, you can apply consistent security, authorization, and authentication options across your clusters.
-
Create a file called
lf-runtime-roles-sec-cfg.json
with the following security configuration.{ "AuthorizationConfiguration": { "IAMConfiguration": { "EnableApplicationScopedIAMRole": true, "ApplicationScopedIAMRoleConfiguration": { "PropagateSourceIdentity": true } }, "LakeFormationConfiguration": { "AuthorizedSessionTagValue": "Amazon EMR" } }, "EncryptionConfiguration": { "EnableInTransitEncryption": true, "InTransitEncryptionConfiguration": { "TLSCertificateConfiguration": {<certificate-configuration>} } } }
-
Next, to ensure that the session tag can authorize Lake Formation, set the
LakeFormationConfiguration/AuthorizedSessionTagValue
property toAmazon EMR
. -
Use the following command to create the Amazon EMR security configuration.
aws emr create-security-configuration \ --name 'iamconfig-with-iam-lf' \ --security-configuration file://lf-runtime-roles-sec-cfg.json
Alternatively, you can use the Amazon EMR console
to create a security configuration with custom settings.
Step 2: Launch an Amazon EMR cluster
Now you’re ready to launch an EMR cluster with the security configuration that you created in the previous step. For more information on security configurations, see Use security configurations to set up Amazon EMR cluster security and Runtime roles for Amazon EMR steps.
Step 3a: Set up Lake Formation-based table-level permissions with Amazon EMR runtime roles
If you don't require fine-grained access control at the column, row, or cell level, you can set up table-level permissions with Glue Data Catalog. To enable table-level access, navigate to the AWS Lake Formation console and select the Application integration settings option from the Administration section in the sidebar. Then, enable the following option and choose Save:
Allow external engines to access data in Amazon S3 locations with full table access
Step 3b: Set up Lake Formation-based column, row, or cell-level permissions with Amazon EMR runtime roles
To apply table and column level permissions with Lake Formation, the data lake administrator
for Lake Formation must set Amazon EMR
as the value for the session tag configuration,
AuthorizedSessionTagValue
. Lake Formation uses this session tag to authorize
callers and provide access to the data lake. You can set this session tag in the
External data filtering section of the Lake Formation console.
Replace 123456789012
with your own AWS account
ID.
Step 4: Configure AWS Glue and Lake Formation grants for Amazon EMR runtime roles
To continue with your setup of Lake Formation based access control with Amazon EMR runtime roles,
you must configure AWS Glue and Lake Formation grants for Amazon EMR runtime roles. To allow your
IAM runtime roles to interact with Lake Formation, grant them access with
lakeformation:GetDataAccess
and glue:Get*
.
Lake Formation permissions control access to AWS Glue Data Catalog resources, Amazon S3 locations, and the
underlying data at those locations. IAM permissions control access to the Lake Formation and
AWS Glue APIs and resources. Although you might have the Lake Formation permission to access a
table in the data catalog (SELECT), your operation fails if you don’t have the IAM
permission on the glue:Get*
API. For more details about Lake Formation access
control, see Lake Formation access control
overview.
-
Create the
emr-runtime-roles-lake-formation-policy.json
file with the following content.{ "Version": "2012-10-17", "Statement": { "Sid": "LakeFormationManagedAccess", "Effect": "Allow", "Action": [ "lakeformation:GetDataAccess", "glue:Get*", "glue:Create*", "glue:Update*" ], "Resource": "*" } }
-
Create the related IAM policy.
aws iam create-policy \ --policy-name emr-runtime-roles-lake-formation-policy \ --policy-document file://emr-runtime-roles-lake-formation-policy.json
-
To assign this policy to your IAM runtime roles, follow the steps in Managing AWS Lake Formation permissions.
You can now use runtime roles and Lake Formation to apply table and column level
permissions. You can also use a source identity to control actions and monitor
operations with AWS CloudTrail. For a detailed, end-to-end example, see
Introducing runtime roles for Amazon EMR steps
For information about how to integrate with Iceberg and AWS Glue Data Catalog for a multi-catalog hierarchy, see Configure Spark to access a multi-catalog hierarchy in AWS Glue Data Catalog.