Creating a data lake from a JDBC source in Lake Formation
This tutorial guides you through the steps to take on the AWS Lake Formation console to create and load your first data lake from a JDBC source using Lake Formation.
Topics
- Intended audience
- JDBC tutorial prerequisites
- Step 1: Create a data analyst user
- Step 2: Create a connection in AWS Glue
- Step 3: Create an Amazon S3 bucket for the data lake
- Step 4: Register an Amazon S3 path
- Step 5: Grant data location permissions
- Step 6: Create a database in the Data Catalog
- Step 7: Grant data permissions
- Step 8: Use a blueprint to create a workflow
- Step 9: Run the workflow
- Step 10: Grant SELECT on the tables
- Step 11: Query the data lake using Amazon Athena
- Step 12: Query the data in the data lake using Amazon Redshift Spectrum
- Step 13: Grant or revoke Lake Formation permissions using Amazon Redshift Spectrum
Intended audience
The following table lists the roles that are used in this AWS Lake Formation JDBC tutorial.
Role | Description |
---|---|
IAM administrator | A user who can create AWS Identity and Access Management (IAM) users and roles and Amazon Simple Storage Service (Amazon S3) buckets. Has the AdministratorAccess AWS managed policy. |
Data lake administrator | A user who can access the Data Catalog, create databases, and grant Lake Formation permissions to other users. Has fewer IAM permissions than the IAM administrator, but enough to administer the data lake. |
Data analyst | A user who can run queries against the data lake. Has only enough permissions to run queries. |
Workflow role | A role with the required IAM policies to run a workflow. |
For information about prerequisites for completing the tutorial, see JDBC tutorial prerequisites.
JDBC tutorial prerequisites
Before you begin the AWS Lake Formation JDBC tutorial, ensure that you've done the following:
-
Complete the tasks in Getting started with Lake Formation.
-
Decide on a JDBC-accessible data store that you want to use for the tutorial.
-
Gather the information that is required to create an AWS Glue connection of type JDBC. This Data Catalog object includes the URL to the data store, login credentials, and if the data store was created in an Amazon Virtual Private Cloud (Amazon VPC), additional VPC-specific configuration information. For more information, see Defining Connections in the AWS Glue Data Catalog in the AWS Glue Developer Guide.
The tutorial assumes that you are familiar with AWS Identity and Access Management (IAM). For information about IAM, see the IAM User Guide.
To get started, proceed to Step 1: Create a data analyst user.
Step 1: Create a data analyst user
In this step, you create an AWS Identity and Access Management (IAM) user to be the data analyst for your data lake in AWS Lake Formation.
This user has the minimum set of permissions to query the data lake.
-
Open the IAM console at https://console.aws.amazon.com/iam
. Sign in as the administrator user that you created in Create a user with administrative access or as a user with the AdministratorAccess
AWS managed policy. -
Create a user named
datalake_user
with the following settings:-
Enable AWS Management Console access.
-
Set a password and do not require password reset.
-
Attach the
AmazonAthenaFullAccess
AWS managed policy. -
Attach the following inline policy. Name the policy
DatalakeUserBasic
.{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "lakeformation:GetDataAccess", "glue:GetTable", "glue:GetTables", "glue:SearchTables", "glue:GetDatabase", "glue:GetDatabases", "glue:GetPartitions", "lakeformation:GetResourceLFTags", "lakeformation:ListLFTags", "lakeformation:GetLFTag", "lakeformation:SearchTablesByLFTags", "lakeformation:SearchDatabasesByLFTags" ], "Resource": "*" } ] }
-
Step 2: Create a connection in AWS Glue
Note
Skip this step if you already have an AWS Glue connection to your JDBC data source.
AWS Lake Formation accesses JDBC data sources through an AWS Glue connection. A connection is a Data Catalog object that contains all the information required to connect to the data source. You can create a connection using the AWS Glue console.
To create a connection
-
Open the AWS Glue the console at https://console.aws.amazon.com/glue/
, and sign in as the administrator user that you created in Create a user with administrative access. -
In the navigation pane, under Data catalog, choose Connections.
-
On the Connectors page, choose Create custom connector.
-
On the Connector properties page, enter
datalake-tutorial
as the connection name, and choose JDBC as the connection type. Then choose Next. -
Continue through the connection wizard and save the connection.
For information on creating a connection, see AWS Glue JDBC connection properties in the AWS Glue Developer Guide.
Step 3: Create an Amazon S3 bucket for the data lake
In this step, you create the Amazon Simple Storage Service (Amazon S3) bucket that is to be the root location of your data lake.
-
Open the Amazon S3 console at https://console.aws.amazon.com/s3/
and sign in as the administrator user that you created in Create a user with administrative access. -
Choose Create bucket, and go through the wizard to create a bucket named
, where<yourName>
-datalake-tutorial<yourName>
is your first initial and last name. For example:jdoe-datalake-tutorial
.For detailed instructions on creating an Amazon S3 bucket, see How Do I Create an S3 Bucket? in the Amazon Simple Storage Service User Guide.
Step 4: Register an Amazon S3 path
In this step, you register an Amazon Simple Storage Service (Amazon S3) path as the root location of your data lake.
-
Open the Lake Formation console at https://console.aws.amazon.com/lakeformation/
. Sign in as the data lake administrator. -
In the navigation pane, under Administration, choose Data lake locations.
-
Choose Register location, and then choose Browse.
-
Select the
bucket that you created previously, accept the default IAM role<yourName>
-datalake-tutorialAWSServiceRoleForLakeFormationDataAccess
, and then choose Register location.For more information about registering locations, see Adding an Amazon S3 location to your data lake.
Step 5: Grant data location permissions
Principals must have data location permissions on a data lake location to create Data Catalog tables or databases that point to that location. You must grant data location permissions to the IAM role for workflows so that the workflow can write to the data ingestion destination.
-
On the Lake Formation console, in the navigation pane, under Permissions, choose Data locations.
-
Choose Grant, and in the Grant permissions dialog box, do the following:
-
For IAM user and roles, choose
LakeFormationWorkflowRole
. -
For Storage locations, choose your
bucket.<yourName>
-datalake-tutorial
-
-
Choose Grant.
For more information about data location permissions, see Underlying data access control.
Step 6: Create a database in the Data Catalog
Metadata tables in the Lake Formation Data Catalog are stored within a database.
-
On the Lake Formation console, in the navigation pane, under Data catalog, choose Databases.
-
Choose Create database, and under Database details, enter the name
lakeformation_tutorial
. -
Leave the other fields blank, and choose Create database.
Step 7: Grant data permissions
You must grant permissions to create metadata tables in the Data Catalog. Because the workflow
runs with the role LakeFormationWorkflowRole
, you must grant these permissions to
the role.
-
On the Lake Formation console, in the navigation pane, under Permissions, choose Data lake permissions.
-
Choose Grant, and in the Grant data permissions dialog box, do the following:
-
Under Principals, for IAM user and roles, choose
LakeFormationWorkflowRole
. -
Under LF-Tags or catalog resources, choose Named data catalog resources.
-
For Databases, choose the database that you created previously,
lakeformation_tutorial
. -
Under Database permissions, select Create table, Alter, and Drop, and clear Super if it is selected.
-
-
Choose Grant.
For more information about granting Lake Formation permissions, see Overview of Lake Formation permissions .
Step 8: Use a blueprint to create a workflow
The AWS Lake Formation workflow generates the AWS Glue jobs, crawlers, and triggers that discover and ingest data into your data lake. You create a workflow based on one of the predefined Lake Formation blueprints.
-
On the Lake Formation console, in the navigation pane, choose Blueprints, and then choose Use blueprint.
-
On the Use a blueprint page, under Blueprint type, choose Database snapshot.
-
Under Import source, for Database connection, choose the connection that you just created,
datalake-tutorial
, or choose an existing connection for your data source. -
For Source data path, enter the path from which to ingest data, in the form
.<database>
/<schema>
/<table>
You can substitute the percent (%) wildcard for schema or table. For databases that support schemas, enter
<database>
/<schema>
/% to match all tables in<schema>
within<database>
. Oracle Database and MySQL don’t support schema in the path; instead, enter<database>
/%. For Oracle Database,<database>
is the system identifier (SID).For example, if an Oracle database has
orcl
as its SID, enterorcl/%
to match all tables that the user specified in the JDCB connection has access to.Important
This field is case-sensitive.
-
Under Import target, specify these parameters:
Target database lakeformation_tutorial
Target storage location s3://
<yourName>
-datalake-tutorialData format (Choose Parquet or CSV) -
For import frequency, choose Run on demand.
-
Under Import options, specify these parameters:
Workflow name lakeformationjdbctest
IAM role LakeFormationWorkflowRole
Table prefix jdbctest
Note
Must be lower case.
-
Choose Create, and wait for the console to report that the workflow was successfully created.
Tip
Did you get the following error message?
User: arn:aws:iam::
<account-id>
:user/<datalake_administrator_user>
is not authorized to perform: iam:PassRole on resource:arn:aws:iam::<account-id>
:role/LakeFormationWorkflowRole...If so, check that you replaced
<account-id>
in the inline policy for the data lake administrator user with a valid AWS account number.
Step 9: Run the workflow
Because you specified that the workflow is run-on-demand, you must manually start the workflow in AWS Lake Formation.
-
On the Lake Formation console, on the Blueprints page, select the workflow
lakeformationjdbctest
. -
Choose Actions, and then choose Start.
-
As the workflow runs, view its progress in the Last run status column. Choose the refresh button occasionally.
The status goes from RUNNING, to Discovering, to Importing, to COMPLETED.
When the workflow is complete:
-
The Data Catalog has new metadata tables.
-
Your data is ingested into the data lake.
If the workflow fails, do the following:
-
Select the workflow. Choose Actions, and then choose View graph.
The workflow opens in the AWS Glue console.
-
Select the workflow and choose the History tab.
-
Select the most recent run and choose View run details.
-
Select a failed job or crawler in the dynamic (runtime) graph, and review the error message. Failed nodes are either red or yellow.
-
Step 10: Grant SELECT on the tables
You must grant the SELECT
permission on the new Data Catalog tables in AWS Lake Formation so
that the data analyst can query the data that the tables point to.
Note
A workflow automatically grants the SELECT
permission on the tables that it
creates to the user who ran it. Because the data lake administrator ran this workflow, you
must grant SELECT
to the data analyst.
-
On the Lake Formation console, in the navigation pane, under Permissions, choose Data lake permissions.
-
Choose Grant, and in the Grant data permissions dialog box, do the following:
-
Under Principals, for IAM user and roles, choose
datalake_user
. -
Under LF-Tags or catalog resources, choose Named data catalog resources.
-
For Databases, choose
lakeformation_tutorial
.The Tables list populates.
-
For Tables, choose one or more tables from your data source.
-
Under Table and column permissions, choose Select.
-
-
Choose Grant.
The next step is performed as the data analyst.
Step 11: Query the data lake using Amazon Athena
Use the Amazon Athena console to query the data in your data lake.
-
Open the Athena console at https://console.aws.amazon.com/athena/
, and sign in as the data analyst, user datalake_user
. -
If necessary, choose Get Started to continue to the Athena query editor.
-
For Data source, choose AwsDataCatalog.
-
For Database, choose
lakeformation_tutorial
.The Tables list populates.
-
In the pop-up menu beside one of the tables, choose Preview table.
The query runs and displays 10 rows of data.
Step 12: Query the data in the data lake using Amazon Redshift Spectrum
You can set up Amazon Redshift Spectrum to query the data that you imported into your Amazon Simple Storage Service (Amazon S3)
data lake. First, create an AWS Identity and Access Management (IAM) role that is used to launch the Amazon Redshift cluster and
to query the Amazon S3 data. Then, grant this role the Select
permissions on the
tables that you want to query. Then, grant the user permissions to use the Amazon Redshift query editor.
Finally, create an Amazon Redshift cluster and run queries.
You create the cluster as an administrator, and query the cluster as a data analyst.
For more information about Amazon Redshift Spectrum, see Using Amazon Redshift Spectrum to Query External Data in the Amazon Redshift Database Developer Guide.
To set up permissions to run Amazon Redshift queries
-
Open the IAM console at https://console.aws.amazon.com/iam/
. Sign in as the administrator user that you created in Create a user with administrative access (user name Administrator
) or as a user with theAdministratorAccess
AWS managed policy. -
In the navigation pane, choose Policies.
If this is your first time choosing Policies, the Welcome to Managed Policies page appears. Choose Get Started.
-
Choose Create policy.
-
Choose the JSON tab.
-
Paste in the following JSON policy document.
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "lakeformation:GetDataAccess", "glue:GetTable", "glue:GetTables", "glue:SearchTables", "glue:GetDatabase", "glue:GetDatabases", "glue:GetPartitions", "lakeformation:GetResourceLFTags", "lakeformation:ListLFTags", "lakeformation:GetLFTag", "lakeformation:SearchTablesByLFTags", "lakeformation:SearchDatabasesByLFTags" ], "Resource": "*" } ] }
-
When you are finished, choose Review to review the policy. The policy validator reports any syntax errors.
-
On the Review policy page, enter the Name as
RedshiftLakeFormationPolicy
for the policy that you are creating. Enter a Description (optional). Review the policy Summary to see the permissions that are granted by your policy. Then choose Create policy to save your work. -
In the navigation pane of the IAM console, choose Roles, and then choose Create role.
-
For Select trusted entity, choose AWS service.
-
Choose the Amazon Redshift service to assume this role.
-
Choose the Redshift Customizable use case for your service. Then choose Next: Permissions.
-
Search for the permissions policy that you created,
RedshiftLakeFormationPolicy
, and select the check box next to the policy name in the list. -
Choose Next: Tags.
-
Choose Next: Review.
-
For Role name, enter the name
RedshiftLakeFormationRole
. -
(Optional) For Role description, enter a description for the new role.
-
Review the role, and then choose Create role.
To grant Select
permissions on the table to be queried in the Lake Formation
database
-
Open the Lake Formation console at https://console.aws.amazon.com/lakeformation/
. Sign in as the data lake administrator. -
In the navigation pane, under Permissions, choose Data lake permissions, and then choose Grant.
-
Provide the following information:
-
For IAM users and roles, choose the IAM role you created,
RedshiftLakeFormationRole
. When you run the Amazon Redshift Query Editor, it uses this IAM role for permission to the data. -
For Database, choose
lakeformation_tutorial
.The tables list populates.
-
For Table, choose a table within the data source to query.
-
Choose the Select table permission.
-
-
Choose Grant.
To set up Amazon Redshift Spectrum and run queries
-
Open the Amazon Redshift console at https://console.aws.amazon.com/redshift
. Sign in as the user Administrator
. -
Choose Create cluster.
-
On the Create cluster page, enter
redshift-lakeformation-demo
for the Cluster identifier. -
For the Node type, select dc2.large.
-
Scroll down, and under Database configurations, enter or accept these parameters:
-
Admin user name:
awsuser
-
Admin user password:
(
Choose a password
)
-
-
Expand Cluster permissions, and for Available IAM roles, choose RedshiftLakeFormationRole. Then choose Add IAM role.
-
If you must use a different port than the default value of 5439, next to Additional configurations, turn off the Use defaults option. Expand the section for Database configurations, and enter a new Database port number.
-
Choose Create cluster.
The Clusters page loads.
-
Wait until the cluster status becomes Available. Choose the refresh icon periodically.
-
Grant the data analyst permission to run queries against the cluster. To do so, complete the following steps.
-
Open the IAM console at https://console.aws.amazon.com/iam/
, and sign in as the Administrator
user. -
In the navigation pane, choose Users, and attach the following managed policies to the user
datalake_user
.-
AmazonRedshiftQueryEditor
-
AmazonRedshiftReadOnlyAccess
-
-
-
Sign out of the Amazon Redshift console and sign back in as user
datalake_user
. -
In the left vertical toolbar, choose the EDITOR icon to open the query editor and connect to the cluster. If the Connect to database dialog box appears, choose the cluster name
redshift-lakeformation-demo
, and enter the database namedev
, the user nameawsuser
, and the password that you created. Then choose Connect to database.Note
If you are not prompted for connection parameters and another cluster is already selected in the query editor, choose Change Connection to open the Connect to database dialog box.
-
In the New Query 1 text box, enter and run the following statement to map the database
lakeformation_tutorial
in Lake Formation to the Amazon Redshift schema nameredshift_jdbc
:Important
Replace
<account-id>
with a valid AWS account number, and<region>
with a valid AWS Region name (for example,us-east-1
).create external schema if not exists redshift_jdbc from DATA CATALOG database 'lakeformation_tutorial' iam_role 'arn:aws:iam::
<account-id>
:role/RedshiftLakeFormationRole' region '<region>
'; -
In the schema list under Select schema, choose redshift_jdbc.
The tables list populates. The query editor shows only the tables on which you were granted Lake Formation data lake permissions.
-
On the pop-up menu next to a table name, choose Preview data.
Amazon Redshift returns the first 10 rows.
You can now run queries against the tables and columns for which you have permissions.
Step 13: Grant or revoke Lake Formation permissions using Amazon Redshift Spectrum
Amazon Redshift supports the ability to grant and revoke Lake Formation permissions on databases and tables using modified SQL statements. These statements are similar to the existing Amazon Redshift statements. For more information, see GRANT and REVOKE in the Amazon Redshift Database Developer Guide.