Creating a data lake from a JDBC source in Lake Formation - AWS Lake Formation

Creating a data lake from a JDBC source in Lake Formation

This tutorial guides you through the steps to take on the AWS Lake Formation console to create and load your first data lake from a JDBC source using Lake Formation.

Intended audience

The following table lists the roles that are used in this AWS Lake Formation JDBC tutorial.

Role Description
IAM administrator A user who can create AWS Identity and Access Management (IAM) users and roles and Amazon Simple Storage Service (Amazon S3) buckets. Has the AdministratorAccess AWS managed policy.
Data lake administrator A user who can access the Data Catalog, create databases, and grant Lake Formation permissions to other users. Has fewer IAM permissions than the IAM administrator, but enough to administer the data lake.
Data analyst A user who can run queries against the data lake. Has only enough permissions to run queries.
Workflow role A role with the required IAM policies to run a workflow.

For information about prerequisites for completing the tutorial, see JDBC tutorial prerequisites.

JDBC tutorial prerequisites

Before you begin the AWS Lake Formation JDBC tutorial, ensure that you've done the following:

  • Complete the tasks in Getting started with Lake Formation.

  • Decide on a JDBC-accessible data store that you want to use for the tutorial.

  • Gather the information that is required to create an AWS Glue connection of type JDBC. This Data Catalog object includes the URL to the data store, login credentials, and if the data store was created in an Amazon Virtual Private Cloud (Amazon VPC), additional VPC-specific configuration information. For more information, see Defining Connections in the AWS Glue Data Catalog in the AWS Glue Developer Guide.

The tutorial assumes that you are familiar with AWS Identity and Access Management (IAM). For information about IAM, see the IAM User Guide.

To get started, proceed to Step 1: Create a data analyst user.

Step 1: Create a data analyst user

In this step, you create an AWS Identity and Access Management (IAM) user to be the data analyst for your data lake in AWS Lake Formation.

This user has the minimum set of permissions to query the data lake.

  1. Open the IAM console at https://console.aws.amazon.com/iam. Sign in as the administrator user that you created in Create a user with administrative access or as a user with the AdministratorAccess AWS managed policy.

  2. Create a user named datalake_user with the following settings:

    • Enable AWS Management Console access.

    • Set a password and do not require password reset.

    • Attach the AmazonAthenaFullAccess AWS managed policy.

    • Attach the following inline policy. Name the policy DatalakeUserBasic.

      { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "lakeformation:GetDataAccess", "glue:GetTable", "glue:GetTables", "glue:SearchTables", "glue:GetDatabase", "glue:GetDatabases", "glue:GetPartitions", "lakeformation:GetResourceLFTags", "lakeformation:ListLFTags", "lakeformation:GetLFTag", "lakeformation:SearchTablesByLFTags", "lakeformation:SearchDatabasesByLFTags" ], "Resource": "*" } ] }

Step 2: Create a connection in AWS Glue

Note

Skip this step if you already have an AWS Glue connection to your JDBC data source.

AWS Lake Formation accesses JDBC data sources through an AWS Glue connection. A connection is a Data Catalog object that contains all the information required to connect to the data source. You can create a connection using the AWS Glue console.

To create a connection
  1. Open the AWS Glue the console at https://console.aws.amazon.com/glue/, and sign in as the administrator user that you created in Create a user with administrative access.

  2. In the navigation pane, under Data catalog, choose Connections.

  3. On the Connectors page, choose Create custom connector.

  4. On the Connector properties page, enter datalake-tutorial as the connection name, and choose JDBC as the connection type. Then choose Next.

  5. Continue through the connection wizard and save the connection.

    For information on creating a connection, see AWS Glue JDBC connection properties in the AWS Glue Developer Guide.

Step 3: Create an Amazon S3 bucket for the data lake

In this step, you create the Amazon Simple Storage Service (Amazon S3) bucket that is to be the root location of your data lake.

  1. Open the Amazon S3 console at https://console.aws.amazon.com/s3/ and sign in as the administrator user that you created in Create a user with administrative access.

  2. Choose Create bucket, and go through the wizard to create a bucket named <yourName>-datalake-tutorial, where <yourName> is your first initial and last name. For example: jdoe-datalake-tutorial.

    For detailed instructions on creating an Amazon S3 bucket, see How Do I Create an S3 Bucket? in the Amazon Simple Storage Service User Guide.

Step 4: Register an Amazon S3 path

In this step, you register an Amazon Simple Storage Service (Amazon S3) path as the root location of your data lake.

  1. Open the Lake Formation console at https://console.aws.amazon.com/lakeformation/. Sign in as the data lake administrator.

  2. In the navigation pane, under Administration, choose Data lake locations.

  3. Choose Register location, and then choose Browse.

  4. Select the <yourName>-datalake-tutorial bucket that you created previously, accept the default IAM role AWSServiceRoleForLakeFormationDataAccess, and then choose Register location.

    For more information about registering locations, see Adding an Amazon S3 location to your data lake.

Step 5: Grant data location permissions

Principals must have data location permissions on a data lake location to create Data Catalog tables or databases that point to that location. You must grant data location permissions to the IAM role for workflows so that the workflow can write to the data ingestion destination.

  1. On the Lake Formation console, in the navigation pane, under Permissions, choose Data locations.

  2. Choose Grant, and in the Grant permissions dialog box, do the following:

    1. For IAM user and roles, choose LakeFormationWorkflowRole.

    2. For Storage locations, choose your <yourName>-datalake-tutorial bucket.

  3. Choose Grant.

For more information about data location permissions, see Underlying data access control.

Step 6: Create a database in the Data Catalog

Metadata tables in the Lake Formation Data Catalog are stored within a database.

  1. On the Lake Formation console, in the navigation pane, under Data catalog, choose Databases.

  2. Choose Create database, and under Database details, enter the name lakeformation_tutorial.

  3. Leave the other fields blank, and choose Create database.

Step 7: Grant data permissions

You must grant permissions to create metadata tables in the Data Catalog. Because the workflow runs with the role LakeFormationWorkflowRole, you must grant these permissions to the role.

  1. On the Lake Formation console, in the navigation pane, under Permissions, choose Data lake permissions.

  2. Choose Grant, and in the Grant data permissions dialog box, do the following:

    1. Under Principals, for IAM user and roles, choose LakeFormationWorkflowRole.

    2. Under LF-Tags or catalog resources, choose Named data catalog resources.

    3. For Databases, choose the database that you created previously, lakeformation_tutorial.

    4. Under Database permissions, select Create table, Alter, and Drop, and clear Super if it is selected.

  3. Choose Grant.

For more information about granting Lake Formation permissions, see Overview of Lake Formation permissions .

Step 8: Use a blueprint to create a workflow

The AWS Lake Formation workflow generates the AWS Glue jobs, crawlers, and triggers that discover and ingest data into your data lake. You create a workflow based on one of the predefined Lake Formation blueprints.

  1. On the Lake Formation console, in the navigation pane, choose Blueprints, and then choose Use blueprint.

  2. On the Use a blueprint page, under Blueprint type, choose Database snapshot.

  3. Under Import source, for Database connection, choose the connection that you just created, datalake-tutorial, or choose an existing connection for your data source.

  4. For Source data path, enter the path from which to ingest data, in the form <database>/<schema>/<table>.

    You can substitute the percent (%) wildcard for schema or table. For databases that support schemas, enter <database>/<schema>/% to match all tables in <schema> within <database>. Oracle Database and MySQL don’t support schema in the path; instead, enter <database>/%. For Oracle Database, <database> is the system identifier (SID).

    For example, if an Oracle database has orcl as its SID, enter orcl/% to match all tables that the user specified in the JDCB connection has access to.

    Important

    This field is case-sensitive.

  5. Under Import target, specify these parameters:

    Target database lakeformation_tutorial
    Target storage location s3://<yourName>-datalake-tutorial
    Data format (Choose Parquet or CSV)
  6. For import frequency, choose Run on demand.

  7. Under Import options, specify these parameters:

    Workflow name lakeformationjdbctest
    IAM role LakeFormationWorkflowRole
    Table prefix jdbctest
    Note

    Must be lower case.

  8. Choose Create, and wait for the console to report that the workflow was successfully created.

    Tip

    Did you get the following error message?

    User: arn:aws:iam::<account-id>:user/<datalake_administrator_user> is not authorized to perform: iam:PassRole on resource:arn:aws:iam::<account-id>:role/LakeFormationWorkflowRole...

    If so, check that you replaced <account-id> in the inline policy for the data lake administrator user with a valid AWS account number.

Step 9: Run the workflow

Because you specified that the workflow is run-on-demand, you must manually start the workflow in AWS Lake Formation.

  1. On the Lake Formation console, on the Blueprints page, select the workflow lakeformationjdbctest.

  2. Choose Actions, and then choose Start.

  3. As the workflow runs, view its progress in the Last run status column. Choose the refresh button occasionally.

    The status goes from RUNNING, to Discovering, to Importing, to COMPLETED.

    When the workflow is complete:

    • The Data Catalog has new metadata tables.

    • Your data is ingested into the data lake.

    If the workflow fails, do the following:

    1. Select the workflow. Choose Actions, and then choose View graph.

      The workflow opens in the AWS Glue console.

    2. Select the workflow and choose the History tab.

    3. Select the most recent run and choose View run details.

    4. Select a failed job or crawler in the dynamic (runtime) graph, and review the error message. Failed nodes are either red or yellow.

Step 10: Grant SELECT on the tables

You must grant the SELECT permission on the new Data Catalog tables in AWS Lake Formation so that the data analyst can query the data that the tables point to.

Note

A workflow automatically grants the SELECT permission on the tables that it creates to the user who ran it. Because the data lake administrator ran this workflow, you must grant SELECT to the data analyst.

  1. On the Lake Formation console, in the navigation pane, under Permissions, choose Data lake permissions.

  2. Choose Grant, and in the Grant data permissions dialog box, do the following:

    1. Under Principals, for IAM user and roles, choose datalake_user.

    2. Under LF-Tags or catalog resources, choose Named data catalog resources.

    3. For Databases, choose lakeformation_tutorial.

      The Tables list populates.

    4. For Tables, choose one or more tables from your data source.

    5. Under Table and column permissions, choose Select.

  3. Choose Grant.

The next step is performed as the data analyst.

Step 11: Query the data lake using Amazon Athena

Use the Amazon Athena console to query the data in your data lake.

  1. Open the Athena console at https://console.aws.amazon.com/athena/, and sign in as the data analyst, user datalake_user.

  2. If necessary, choose Get Started to continue to the Athena query editor.

  3. For Data source, choose AwsDataCatalog.

  4. For Database, choose lakeformation_tutorial.

    The Tables list populates.

  5. In the pop-up menu beside one of the tables, choose Preview table.

    The query runs and displays 10 rows of data.

Step 12: Query the data in the data lake using Amazon Redshift Spectrum

You can set up Amazon Redshift Spectrum to query the data that you imported into your Amazon Simple Storage Service (Amazon S3) data lake. First, create an AWS Identity and Access Management (IAM) role that is used to launch the Amazon Redshift cluster and to query the Amazon S3 data. Then, grant this role the Select permissions on the tables that you want to query. Then, grant the user permissions to use the Amazon Redshift query editor. Finally, create an Amazon Redshift cluster and run queries.

You create the cluster as an administrator, and query the cluster as a data analyst.

For more information about Amazon Redshift Spectrum, see Using Amazon Redshift Spectrum to Query External Data in the Amazon Redshift Database Developer Guide.

To set up permissions to run Amazon Redshift queries
  1. Open the IAM console at https://console.aws.amazon.com/iam/. Sign in as the administrator user that you created in Create a user with administrative access (user name Administrator) or as a user with the AdministratorAccess AWS managed policy.

  2. In the navigation pane, choose Policies.

    If this is your first time choosing Policies, the Welcome to Managed Policies page appears. Choose Get Started.

  3. Choose Create policy.

  4. Choose the JSON tab.

  5. Paste in the following JSON policy document.

    { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "lakeformation:GetDataAccess", "glue:GetTable", "glue:GetTables", "glue:SearchTables", "glue:GetDatabase", "glue:GetDatabases", "glue:GetPartitions", "lakeformation:GetResourceLFTags", "lakeformation:ListLFTags", "lakeformation:GetLFTag", "lakeformation:SearchTablesByLFTags", "lakeformation:SearchDatabasesByLFTags" ], "Resource": "*" } ] }
  6. When you are finished, choose Review to review the policy. The policy validator reports any syntax errors.

  7. On the Review policy page, enter the Name as RedshiftLakeFormationPolicy for the policy that you are creating. Enter a Description (optional). Review the policy Summary to see the permissions that are granted by your policy. Then choose Create policy to save your work.

  8. In the navigation pane of the IAM console, choose Roles, and then choose Create role.

  9. For Select trusted entity, choose AWS service.

  10. Choose the Amazon Redshift service to assume this role.

  11. Choose the Redshift Customizable use case for your service. Then choose Next: Permissions.

  12. Search for the permissions policy that you created, RedshiftLakeFormationPolicy, and select the check box next to the policy name in the list.

  13. Choose Next: Tags.

  14. Choose Next: Review.

  15. For Role name, enter the name RedshiftLakeFormationRole.

  16. (Optional) For Role description, enter a description for the new role.

  17. Review the role, and then choose Create role.

To grant Select permissions on the table to be queried in the Lake Formation database
  1. Open the Lake Formation console at https://console.aws.amazon.com/lakeformation/. Sign in as the data lake administrator.

  2. In the navigation pane, under Permissions, choose Data lake permissions, and then choose Grant.

  3. Provide the following information:

    • For IAM users and roles, choose the IAM role you created, RedshiftLakeFormationRole. When you run the Amazon Redshift Query Editor, it uses this IAM role for permission to the data.

    • For Database, choose lakeformation_tutorial.

      The tables list populates.

    • For Table, choose a table within the data source to query.

    • Choose the Select table permission.

  4. Choose Grant.

To set up Amazon Redshift Spectrum and run queries
  1. Open the Amazon Redshift console at https://console.aws.amazon.com/redshift. Sign in as the user Administrator.

  2. Choose Create cluster.

  3. On the Create cluster page, enter redshift-lakeformation-demo for the Cluster identifier.

  4. For the Node type, select dc2.large.

  5. Scroll down, and under Database configurations, enter or accept these parameters:

    • Admin user name: awsuser

    • Admin user password: (Choose a password)

  6. Expand Cluster permissions, and for Available IAM roles, choose RedshiftLakeFormationRole. Then choose Add IAM role.

  7. If you must use a different port than the default value of 5439, next to Additional configurations, turn off the Use defaults option. Expand the section for Database configurations, and enter a new Database port number.

  8. Choose Create cluster.

    The Clusters page loads.

  9. Wait until the cluster status becomes Available. Choose the refresh icon periodically.

  10. Grant the data analyst permission to run queries against the cluster. To do so, complete the following steps.

    1. Open the IAM console at https://console.aws.amazon.com/iam/, and sign in as the Administrator user.

    2. In the navigation pane, choose Users, and attach the following managed policies to the user datalake_user.

      • AmazonRedshiftQueryEditor

      • AmazonRedshiftReadOnlyAccess

  11. Sign out of the Amazon Redshift console and sign back in as user datalake_user.

  12. In the left vertical toolbar, choose the EDITOR icon to open the query editor and connect to the cluster. If the Connect to database dialog box appears, choose the cluster name redshift-lakeformation-demo, and enter the database name dev, the user name awsuser, and the password that you created. Then choose Connect to database.

    Note

    If you are not prompted for connection parameters and another cluster is already selected in the query editor, choose Change Connection to open the Connect to database dialog box.

  13. In the New Query 1 text box, enter and run the following statement to map the database lakeformation_tutorial in Lake Formation to the Amazon Redshift schema name redshift_jdbc:

    Important

    Replace <account-id> with a valid AWS account number, and <region> with a valid AWS Region name (for example, us-east-1).

    create external schema if not exists redshift_jdbc from DATA CATALOG database 'lakeformation_tutorial' iam_role 'arn:aws:iam::<account-id>:role/RedshiftLakeFormationRole' region '<region>';
  14. In the schema list under Select schema, choose redshift_jdbc.

    The tables list populates. The query editor shows only the tables on which you were granted Lake Formation data lake permissions.

  15. On the pop-up menu next to a table name, choose Preview data.

    Amazon Redshift returns the first 10 rows.

    You can now run queries against the tables and columns for which you have permissions.

Step 13: Grant or revoke Lake Formation permissions using Amazon Redshift Spectrum

Amazon Redshift supports the ability to grant and revoke Lake Formation permissions on databases and tables using modified SQL statements. These statements are similar to the existing Amazon Redshift statements. For more information, see GRANT and REVOKE in the Amazon Redshift Database Developer Guide.