适用于 AWS Glue 的 AWS CloudFormation
AWS CloudFormation 是可创建许多 AWS 资源的服务。AWS Glue 提供了 API 操作以在 AWS Glue Data Catalog 中创建对象。但是,在 AWS CloudFormation 模板文件中定义并创建 AWS Glue 对象和其他相关 AWS 资源对象可能更方便。然后,您可以自动化创建对象的过程。
AWS CloudFormation 提供了简化的语法 JSON(JavaScript 对象表示法)或 YAML(YAML Ain't 标记语言)来表示 AWS 资源的创建。您可以使用 AWS CloudFormation 模板来定义数据目录对象,如数据库、表、分区、爬网程序、分类器和连接。您还可以定义 ETL 对象,如作业、触发器和开发终端节点。您可创建一个模板来描述所需的所有 AWS 资源,而 AWS CloudFormation 则可为您预配和配置这些资源。
相关详情,请参阅《AWS CloudFormation 用户指南》中的什么是 AWS CloudFormation?以及使用 AWS CloudFormation 模板。
如果您计划使用与 AWS Glue 兼容的 AWS CloudFormation 模板,则您作为管理员,必须授予对 AWS CloudFormation 及其依赖的 AWS 服务和操作的访问权。要授予创建 AWS CloudFormation 资源的权限,请将以下策略附加到使用 AWS CloudFormation 的用户:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "cloudformation:*" ], "Resource": "*" } ] }
下表包含 AWS CloudFormation 模板可代表您执行的操作。它包括指向您可添加到 AWS CloudFormation 模板的 AWS 资源类型及其属性类型的相关信息的链接。
AWS Glue 资源 | AWS CloudFormation 模板 | AWS Glue 示例 |
---|---|---|
分类器 | AWS::Glue::Classifier | Grok 分类器、JSON 分类器、XML 分类器 |
Connection | AWS::Glue::Connection | MySQL 连接 |
爬网程序 | AWS::Glue::Crawler | Amazon S3 爬网程序、MySQL 爬网程序 |
数据库 | AWS::Glue::Database | 空数据库、具有表的数据库 |
开发终端节点 | AWS::Glue::DevEndpoint | 开发终端节点 |
任务 | AWS::Glue::Job | Amazon S3 任务、JDBC 任务 |
机器学习转换 | AWS::Glue::MLTransform | 机器学习转换 |
数据质量规则集 | AWS::Glue::DataQualityRuleset | 数据质量规则集、使用 EventBridge 调度器的数据质量规则集 |
分区 | AWS::Glue::Partition | 表的分区 |
表 | AWS::Glue::Table | 数据库中的表 |
触发器 | AWS::Glue::Trigger | 按需触发器、计划触发器、条件触发器 |
要开始使用,请使用以下示例模板并使用您自己的元数据对其进行自定义。然后,使用 AWS CloudFormation 控制台创建 AWS CloudFormation 堆栈以将对象添加到 AWS Glue 和任何关联的服务。AWS Glue 对象中的许多字段都是可选字段。这些模板说明了必填字段或正常运行的 AWS Glue 对象所需的字段。
AWS CloudFormation 模板可以采用 JSON 或 YAML 格式。在这些示例中,使用了 YAML 以便于阅读。这些示例包含注释 (#
) 以介绍模板中定义的值。
AWS CloudFormation 模板可以包含 Parameters
部分。可以在示例文本中或在将 YAML 文件提交到 AWS CloudFormation 控制台以创建堆栈时更改此部分。模板的 Resources
部分包含 AWS Glue 和相关对象的定义。AWS CloudFormation 模板语法定义可能包含包括更详细的属性语法的属性。可能并非所有属性都是创建 AWS Glue 对象所必需的。这些示例显示创建 AWS Glue 对象时常用的属性的示例值。
AWS Glue 数据库的示例 AWS CloudFormation 模板
数据目录中的 AWS Glue 数据库包含元数据表。数据库包含非常少的属性,可在数据目录中使用 AWS CloudFormation 模板进行创建。提供了以下示例模板以帮助您入门并说明如何将 AWS CloudFormation 堆栈与 AWS Glue 一起使用。示例模板创建的唯一资源是名为 cfn-mysampledatabase
的数据库。您可以更改它,方法是编辑示例的文本,或在提交 YAML 时在 AWS CloudFormation 控制台上更改值。
下面显示创建 AWS Glue 数据库时常用的属性的示例值。有关 AWS Glue 的 AWS CloudFormation 数据库模板的更多信息,请参阅 AWS::Glue::Database。
--- AWSTemplateFormatVersion: '2010-09-09' # Sample CloudFormation template in YAML to demonstrate creating a database named mysampledatabase # The metadata created in the Data Catalog points to the flights public S3 bucket # # Parameters section contains names that are substituted in the Resources section # These parameters are the names the resources created in the Data Catalog Parameters: CFNDatabaseName: Type: String Default: cfn-mysampledatabse # Resources section defines metadata for the Data Catalog Resources: # Create an AWS Glue database CFNDatabaseFlights: Type: AWS::Glue::Database Properties: # The database is created in the Data Catalog for your account CatalogId: !Ref AWS::AccountId DatabaseInput: # The name of the database is defined in the Parameters section above Name: !Ref CFNDatabaseName Description: Database to hold tables for flights data LocationUri: s3://crawler-public-us-east-1/flight/2016/csv/ #Parameters: Leave AWS database parameters blank
AWS Glue 数据库、表和分区的示例 AWS CloudFormation 模板
AWS Glue 表包含定义要使用 ETL 脚本处理的数据的结构和位置的元数据。在表中,可以定义分区以并行处理您的数据。分区是您使用键定义的数据块。例如,使用月份作为键,1 月的所有数据包含在同一分区中。在 AWS Glue 中,数据库可以包含表,表可以包含分区。
以下示例显示如何使用 AWS CloudFormation 模板填充数据库、表和分区。基本数据格式为 csv
并使用逗号 (,) 分隔。因为数据库必须先存在才能包含表,表必须先存在才能创建分区,所以模板在创建这些对象时使用 DependsOn
语句来定义它们的依赖关系。
此示例中的值定义一个包含公开可用的 Amazon S3 存储桶中的航班数据的表。为方便说明,仅定义了一些数据列和一个分区键。还在数据目录中定义了 4 个分区。还在 StorageDescriptor
字段中显示了用于描述基本数据的存储的一些字段。
--- AWSTemplateFormatVersion: '2010-09-09' # Sample CloudFormation template in YAML to demonstrate creating a database, a table, and partitions # The metadata created in the Data Catalog points to the flights public S3 bucket # # Parameters substituted in the Resources section # These parameters are names of the resources created in the Data Catalog Parameters: CFNDatabaseName: Type: String Default: cfn-database-flights-1 CFNTableName1: Type: String Default: cfn-manual-table-flights-1 # Resources to create metadata in the Data Catalog Resources: ### # Create an AWS Glue database CFNDatabaseFlights: Type: AWS::Glue::Database Properties: CatalogId: !Ref AWS::AccountId DatabaseInput: Name: !Ref CFNDatabaseName Description: Database to hold tables for flights data ### # Create an AWS Glue table CFNTableFlights: # Creating the table waits for the database to be created DependsOn: CFNDatabaseFlights Type: AWS::Glue::Table Properties: CatalogId: !Ref AWS::AccountId DatabaseName: !Ref CFNDatabaseName TableInput: Name: !Ref CFNTableName1 Description: Define the first few columns of the flights table TableType: EXTERNAL_TABLE Parameters: { "classification": "csv" } # ViewExpandedText: String PartitionKeys: # Data is partitioned by month - Name: mon Type: bigint StorageDescriptor: OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Columns: - Name: year Type: bigint - Name: quarter Type: bigint - Name: month Type: bigint - Name: day_of_month Type: bigint InputFormat: org.apache.hadoop.mapred.TextInputFormat Location: s3://crawler-public-us-east-1/flight/2016/csv/ SerdeInfo: Parameters: field.delim: "," SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe # Partition 1 # Create an AWS Glue partition CFNPartitionMon1: DependsOn: CFNTableFlights Type: AWS::Glue::Partition Properties: CatalogId: !Ref AWS::AccountId DatabaseName: !Ref CFNDatabaseName TableName: !Ref CFNTableName1 PartitionInput: Values: - 1 StorageDescriptor: OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Columns: - Name: mon Type: bigint InputFormat: org.apache.hadoop.mapred.TextInputFormat Location: s3://crawler-public-us-east-1/flight/2016/csv/mon=1/ SerdeInfo: Parameters: field.delim: "," SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe # Partition 2 # Create an AWS Glue partition CFNPartitionMon2: DependsOn: CFNTableFlights Type: AWS::Glue::Partition Properties: CatalogId: !Ref AWS::AccountId DatabaseName: !Ref CFNDatabaseName TableName: !Ref CFNTableName1 PartitionInput: Values: - 2 StorageDescriptor: OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Columns: - Name: mon Type: bigint InputFormat: org.apache.hadoop.mapred.TextInputFormat Location: s3://crawler-public-us-east-1/flight/2016/csv/mon=2/ SerdeInfo: Parameters: field.delim: "," SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe # Partition 3 # Create an AWS Glue partition CFNPartitionMon3: DependsOn: CFNTableFlights Type: AWS::Glue::Partition Properties: CatalogId: !Ref AWS::AccountId DatabaseName: !Ref CFNDatabaseName TableName: !Ref CFNTableName1 PartitionInput: Values: - 3 StorageDescriptor: OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Columns: - Name: mon Type: bigint InputFormat: org.apache.hadoop.mapred.TextInputFormat Location: s3://crawler-public-us-east-1/flight/2016/csv/mon=3/ SerdeInfo: Parameters: field.delim: "," SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe # Partition 4 # Create an AWS Glue partition CFNPartitionMon4: DependsOn: CFNTableFlights Type: AWS::Glue::Partition Properties: CatalogId: !Ref AWS::AccountId DatabaseName: !Ref CFNDatabaseName TableName: !Ref CFNTableName1 PartitionInput: Values: - 4 StorageDescriptor: OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Columns: - Name: mon Type: bigint InputFormat: org.apache.hadoop.mapred.TextInputFormat Location: s3://crawler-public-us-east-1/flight/2016/csv/mon=4/ SerdeInfo: Parameters: field.delim: "," SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
AWS Glue grok 分类器的示例 AWS CloudFormation 模板
AWS Glue 分类器确定数据的架构。一种类型的自定义分类器使用 grok 模式来匹配您的数据。如果模式匹配,则使用自定义分类器来创建您的表的架构并将 classification
设置为分类器定义中设置的值。
此示例创建了一个分类器,该分类器创建了具有一个名为 message
的列的架构并将 classification 设置为 greedy
。
--- AWSTemplateFormatVersion: '2010-09-09' # Sample CFN YAML to demonstrate creating a classifier # # Parameters section contains names that are substituted in the Resources section # These parameters are the names the resources created in the Data Catalog Parameters: # The name of the classifier to be created CFNClassifierName: Type: String Default: cfn-classifier-grok-one-column-1 # # # Resources section defines metadata for the Data Catalog Resources: # Create classifier that uses grok pattern to put all data in one column and classifies it as "greedy". CFNClassifierFlights: Type: AWS::Glue::Classifier Properties: GrokClassifier: #Grok classifier that puts all data in one column Name: !Ref CFNClassifierName Classification: greedy GrokPattern: "%{GREEDYDATA:message}" #CustomPatterns: none
AWS Glue JSON 分类器的示例 AWS CloudFormation 模板
AWS Glue 分类器确定数据的架构。一种类型的自定义分类器使用 JsonPath
字符串,该字符串定义供分类器分类的 JSON 数据。AWS Glue 支持小部分适用于 JsonPath
的运算符,如编写 JsonPath 自定义分类器中所述。
如果模式匹配,则使用自定义分类器来创建您的表的架构。
此示例创建了一个分类器,该分类器创建一个架构,其每条记录都在对象的 Records3
数组中。
--- AWSTemplateFormatVersion: '2010-09-09' # Sample CFN YAML to demonstrate creating a JSON classifier # # Parameters section contains names that are substituted in the Resources section # These parameters are the names the resources created in the Data Catalog Parameters: # The name of the classifier to be created CFNClassifierName: Type: String Default: cfn-classifier-json-one-column-1 # # # Resources section defines metadata for the Data Catalog Resources: # Create classifier that uses a JSON pattern. CFNClassifierFlights: Type: AWS::Glue::Classifier Properties: JSONClassifier: #JSON classifier Name: !Ref CFNClassifierName JsonPath: $.Records3[*]
AWS Glue XML 分类器的示例 AWS CloudFormation 模板
AWS Glue 分类器确定数据的架构。一种类型的自定义分类器指定 XML 标签来指定包含要分析的 XML 文档中的每条记录的元素。如果模式匹配,则使用自定义分类器来创建您的表的架构并将 classification
设置为分类器定义中设置的值。
此示例创建了一个分类器,该分类器创建了一个其每条记录位于 Record
标签中的架构并将分类设置为 XML
。
--- AWSTemplateFormatVersion: '2010-09-09' # Sample CFN YAML to demonstrate creating an XML classifier # # Parameters section contains names that are substituted in the Resources section # These parameters are the names the resources created in the Data Catalog Parameters: # The name of the classifier to be created CFNClassifierName: Type: String Default: cfn-classifier-xml-one-column-1 # # # Resources section defines metadata for the Data Catalog Resources: # Create classifier that uses the XML pattern and classifies it as "XML". CFNClassifierFlights: Type: AWS::Glue::Classifier Properties: XMLClassifier: #XML classifier Name: !Ref CFNClassifierName Classification: XML RowTag: <Records>
Amazon S3 的 AWS Glue 爬网程序的示例 AWS CloudFormation 模板
AWS Glue 爬网程序在数据目录中创建与您的数据对应的元数据表。然后,可以在您的 ETL 任务中使用这些表定义作为源和目标。
此示例在数据目录中创建爬网程序、所需的 IAM 角色和 AWS Glue 数据库。当此爬网程序运行时,它会代入 IAM 角色并在数据库中为公用航班数据创建一个表。使用前缀“cfn_sample_1_
”创建此表。此模板创建的 IAM 角色允许全局权限;您可能希望创建自定义角色。此分类器没有定义任何自定义分类器。默认使用 AWS Glue 内置分类器。
当您将此示例提交到 AWS CloudFormation 控制台时,您必须确认要创建 IAM 角色。
--- AWSTemplateFormatVersion: '2010-09-09' # Sample CFN YAML to demonstrate creating a crawler # # Parameters section contains names that are substituted in the Resources section # These parameters are the names the resources created in the Data Catalog Parameters: # The name of the crawler to be created CFNCrawlerName: Type: String Default: cfn-crawler-flights-1 CFNDatabaseName: Type: String Default: cfn-database-flights-1 CFNTablePrefixName: Type: String Default: cfn_sample_1_ # # # Resources section defines metadata for the Data Catalog Resources: #Create IAM Role assumed by the crawler. For demonstration, this role is given all permissions. CFNRoleFlights: Type: AWS::IAM::Role Properties: AssumeRolePolicyDocument: Version: "2012-10-17" Statement: - Effect: "Allow" Principal: Service: - "glue.amazonaws.com" Action: - "sts:AssumeRole" Path: "/" Policies: - PolicyName: "root" PolicyDocument: Version: "2012-10-17" Statement: - Effect: "Allow" Action: "*" Resource: "*" # Create a database to contain tables created by the crawler CFNDatabaseFlights: Type: AWS::Glue::Database Properties: CatalogId: !Ref AWS::AccountId DatabaseInput: Name: !Ref CFNDatabaseName Description: "AWS Glue container to hold metadata tables for the flights crawler" #Create a crawler to crawl the flights data on a public S3 bucket CFNCrawlerFlights: Type: AWS::Glue::Crawler Properties: Name: !Ref CFNCrawlerName Role: !GetAtt CFNRoleFlights.Arn #Classifiers: none, use the default classifier Description: AWS Glue crawler to crawl flights data #Schedule: none, use default run-on-demand DatabaseName: !Ref CFNDatabaseName Targets: S3Targets: # Public S3 bucket with the flights data - Path: "s3://crawler-public-us-east-1/flight/2016/csv" TablePrefix: !Ref CFNTablePrefixName SchemaChangePolicy: UpdateBehavior: "UPDATE_IN_DATABASE" DeleteBehavior: "LOG" Configuration: "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}"
AWS Glue 连接的示例 AWS CloudFormation 模板
数据目录中的 AWS Glue 连接包含连接到 JDBC 数据库所需的 JDBC 和网络信息。在您连接到 JDBC 数据库以运行 ETL 作业或对其进行爬网时,会使用此信息。
此示例创建到名为 devdb
的 Amazon RDS MySQL 数据库的连接。使用此连接时,还必须提供 IAM 角色、数据库凭证和网络连接值。请参阅模板中的必填字段的详细信息。
--- AWSTemplateFormatVersion: '2010-09-09' # Sample CFN YAML to demonstrate creating a connection # # Parameters section contains names that are substituted in the Resources section # These parameters are the names the resources created in the Data Catalog Parameters: # The name of the connection to be created CFNConnectionName: Type: String Default: cfn-connection-mysql-flights-1 CFNJDBCString: Type: String Default: "jdbc:mysql://xxx-mysql.yyyyyyyyyyyyyy.us-east-1.rds.amazonaws.com:3306/devdb" CFNJDBCUser: Type: String Default: "master" CFNJDBCPassword: Type: String Default: "12345678" NoEcho: true # # # Resources section defines metadata for the Data Catalog Resources: CFNConnectionMySQL: Type: AWS::Glue::Connection Properties: CatalogId: !Ref AWS::AccountId ConnectionInput: Description: "Connect to MySQL database." ConnectionType: "JDBC" #MatchCriteria: none PhysicalConnectionRequirements: AvailabilityZone: "us-east-1d" SecurityGroupIdList: - "sg-7d52b812" SubnetId: "subnet-84f326ee" ConnectionProperties: { "JDBC_CONNECTION_URL": !Ref CFNJDBCString, "USERNAME": !Ref CFNJDBCUser, "PASSWORD": !Ref CFNJDBCPassword } Name: !Ref CFNConnectionName
JDBC 的 AWS Glue 爬网程序的示例 AWS CloudFormation 模板
AWS Glue 爬网程序在数据目录中创建与您的数据对应的元数据表。然后,可以在您的 ETL 任务中使用这些表定义作为源和目标。
此示例在数据目录中创建爬网程序、所需的 IAM 角色和 AWS Glue 数据库。当此爬网程序运行时,它会代入 IAM 角色并在数据库中为存储在 MySQL 数据库中的公用航班数据创建一个表。使用前缀“cfn_jdbc_1_
”创建此表。此模板创建的 IAM 角色允许全局权限;您可能希望创建自定义角色。不能为 JDBC 数据定义自定义分类器。默认使用 AWS Glue 内置分类器。
当您将此示例提交到 AWS CloudFormation 控制台时,您必须确认要创建 IAM 角色。
--- AWSTemplateFormatVersion: '2010-09-09' # Sample CFN YAML to demonstrate creating a crawler # # Parameters section contains names that are substituted in the Resources section # These parameters are the names the resources created in the Data Catalog Parameters: # The name of the crawler to be created CFNCrawlerName: Type: String Default: cfn-crawler-jdbc-flights-1 # The name of the database to be created to contain tables CFNDatabaseName: Type: String Default: cfn-database-jdbc-flights-1 # The prefix for all tables crawled and created CFNTablePrefixName: Type: String Default: cfn_jdbc_1_ # The name of the existing connection to the MySQL database CFNConnectionName: Type: String Default: cfn-connection-mysql-flights-1 # The name of the JDBC path (database/schema/table) with wildcard (%) to crawl CFNJDBCPath: Type: String Default: saldev/% # # # Resources section defines metadata for the Data Catalog Resources: #Create IAM Role assumed by the crawler. For demonstration, this role is given all permissions. CFNRoleFlights: Type: AWS::IAM::Role Properties: AssumeRolePolicyDocument: Version: "2012-10-17" Statement: - Effect: "Allow" Principal: Service: - "glue.amazonaws.com" Action: - "sts:AssumeRole" Path: "/" Policies: - PolicyName: "root" PolicyDocument: Version: "2012-10-17" Statement: - Effect: "Allow" Action: "*" Resource: "*" # Create a database to contain tables created by the crawler CFNDatabaseFlights: Type: AWS::Glue::Database Properties: CatalogId: !Ref AWS::AccountId DatabaseInput: Name: !Ref CFNDatabaseName Description: "AWS Glue container to hold metadata tables for the flights crawler" #Create a crawler to crawl the flights data in MySQL database CFNCrawlerFlights: Type: AWS::Glue::Crawler Properties: Name: !Ref CFNCrawlerName Role: !GetAtt CFNRoleFlights.Arn #Classifiers: none, use the default classifier Description: AWS Glue crawler to crawl flights data #Schedule: none, use default run-on-demand DatabaseName: !Ref CFNDatabaseName Targets: JdbcTargets: # JDBC MySQL database with the flights data - ConnectionName: !Ref CFNConnectionName Path: !Ref CFNJDBCPath #Exclusions: none TablePrefix: !Ref CFNTablePrefixName SchemaChangePolicy: UpdateBehavior: "UPDATE_IN_DATABASE" DeleteBehavior: "LOG" Configuration: "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}"
Amazon S3 到 Amazon S3 AWS Glue 作业的示例 AWS CloudFormation 模板
数据目录中的 AWS Glue 任务包含在 AWS Glue 中运行脚本所需的参数值。
此示例创建从 csv
格式的 Amazon S3 存储桶读取航班数据并将其写入 Amazon S3 Parquet 文件的任务。此任务运行的脚本必须已存在。您可以使用 AWS Glue 控制台为您的环境生成 ETL 脚本。在运行此任务时,还必须提供具有正确权限的 IAM 角色。
模板中显示了常用参数值。例如,AllocatedCapacity
(DPU)默认为 5。
--- AWSTemplateFormatVersion: '2010-09-09' # Sample CFN YAML to demonstrate creating a job using the public flights S3 table in a public bucket # # Parameters section contains names that are substituted in the Resources section # These parameters are the names the resources created in the Data Catalog Parameters: # The name of the job to be created CFNJobName: Type: String Default: cfn-job-S3-to-S3-2 # The name of the IAM role that the job assumes. It must have access to data, script, temporary directory CFNIAMRoleName: Type: String Default: AWSGlueServiceRoleGA # The S3 path where the script for this job is located CFNScriptLocation: Type: String Default: s3://aws-glue-scripts-123456789012-us-east-1/myid/sal-job-test2 # # # Resources section defines metadata for the Data Catalog Resources: # Create job to run script which accesses flightscsv table and write to S3 file as parquet. # The script already exists and is called by this job CFNJobFlights: Type: AWS::Glue::Job Properties: Role: !Ref CFNIAMRoleName #DefaultArguments: JSON object # If script written in Scala, then set DefaultArguments={'--job-language'; 'scala', '--class': 'your scala class'} #Connections: No connection needed for S3 to S3 job # ConnectionsList #MaxRetries: Double Description: Job created with CloudFormation #LogUri: String Command: Name: glueetl ScriptLocation: !Ref CFNScriptLocation # for access to directories use proper IAM role with permission to buckets and folders that begin with "aws-glue-" # script uses temp directory from job definition if required (temp directory not used S3 to S3) # script defines target for output as s3://aws-glue-target/sal AllocatedCapacity: 5 ExecutionProperty: MaxConcurrentRuns: 1 Name: !Ref CFNJobName
JDBC 到 Amazon S3 的 AWS Glue 作业的示例 AWS CloudFormation 模板
数据目录中的 AWS Glue 任务包含在 AWS Glue 中运行脚本所需的参数值。
此示例创建从名为 cfn-connection-mysql-flights-1
的连接所定义的 MySQL JDBC 数据库读取航班数据并将其写入 Amazon S3 Parquet 文件的任务。此任务运行的脚本必须已存在。您可以使用 AWS Glue 控制台为您的环境生成 ETL 脚本。在运行此任务时,还必须提供具有正确权限的 IAM 角色。
模板中显示了常用参数值。例如,AllocatedCapacity
(DPU)默认为 5。
--- AWSTemplateFormatVersion: '2010-09-09' # Sample CFN YAML to demonstrate creating a job using a MySQL JDBC DB with the flights data to an S3 file # # Parameters section contains names that are substituted in the Resources section # These parameters are the names the resources created in the Data Catalog Parameters: # The name of the job to be created CFNJobName: Type: String Default: cfn-job-JDBC-to-S3-1 # The name of the IAM role that the job assumes. It must have access to data, script, temporary directory CFNIAMRoleName: Type: String Default: AWSGlueServiceRoleGA # The S3 path where the script for this job is located CFNScriptLocation: Type: String Default: s3://aws-glue-scripts-123456789012-us-east-1/myid/sal-job-dec4a # The name of the connection used for JDBC data source CFNConnectionName: Type: String Default: cfn-connection-mysql-flights-1 # # # Resources section defines metadata for the Data Catalog Resources: # Create job to run script which accesses JDBC flights table via a connection and write to S3 file as parquet. # The script already exists and is called by this job CFNJobFlights: Type: AWS::Glue::Job Properties: Role: !Ref CFNIAMRoleName #DefaultArguments: JSON object # For example, if required by script, set temporary directory as DefaultArguments={'--TempDir'; 's3://aws-glue-temporary-xyc/sal'} Connections: Connections: - !Ref CFNConnectionName #MaxRetries: Double Description: Job created with CloudFormation using existing script #LogUri: String Command: Name: glueetl ScriptLocation: !Ref CFNScriptLocation # for access to directories use proper IAM role with permission to buckets and folders that begin with "aws-glue-" # if required, script defines temp directory as argument TempDir and used in script like redshift_tmp_dir = args["TempDir"] # script defines target for output as s3://aws-glue-target/sal AllocatedCapacity: 5 ExecutionProperty: MaxConcurrentRuns: 1 Name: !Ref CFNJobName
AWS Glue 按需触发器的示例 AWS CloudFormation 模板
数据目录中的 AWS Glue 触发器包含在它触发时启动任务运行所需的参数值。在您启用按需触发器时,该按需触发器触发。
此示例创建启动一个名为 cfn-job-S3-to-S3-1
的作业的按需触发器。
--- AWSTemplateFormatVersion: '2010-09-09' # Sample CFN YAML to demonstrate creating an on-demand trigger # # Parameters section contains names that are substituted in the Resources section # These parameters are the names the resources created in the Data Catalog Parameters: # The existing job to be started by this trigger CFNJobName: Type: String Default: cfn-job-S3-to-S3-1 # The name of the trigger to be created CFNTriggerName: Type: String Default: cfn-trigger-ondemand-flights-1 # # Resources section defines metadata for the Data Catalog # Sample CFN YAML to demonstrate creating an on-demand trigger for a job Resources: # Create trigger to run an existing job (CFNJobName) on an on-demand schedule. CFNTriggerSample: Type: AWS::Glue::Trigger Properties: Name: Ref: CFNTriggerName Description: Trigger created with CloudFormation Type: ON_DEMAND Actions: - JobName: !Ref CFNJobName # Arguments: JSON object #Schedule: #Predicate:
AWS Glue 计划触发器的示例 AWS CloudFormation 模板
数据目录中的 AWS Glue 触发器包含在它触发时启动任务运行所需的参数值。在启用计划触发器并弹出 cron 计时器时,该计划触发器触发。
此示例创建启动一个名为 cfn-job-S3-to-S3-1
的作业的计划触发器。计时器是在工作日每隔 10 分钟运行一次作业的 cron 表达式。
--- AWSTemplateFormatVersion: '2010-09-09' # Sample CFN YAML to demonstrate creating a scheduled trigger # # Parameters section contains names that are substituted in the Resources section # These parameters are the names the resources created in the Data Catalog Parameters: # The existing job to be started by this trigger CFNJobName: Type: String Default: cfn-job-S3-to-S3-1 # The name of the trigger to be created CFNTriggerName: Type: String Default: cfn-trigger-scheduled-flights-1 # # Resources section defines metadata for the Data Catalog # Sample CFN YAML to demonstrate creating a scheduled trigger for a job # Resources: # Create trigger to run an existing job (CFNJobName) on a cron schedule. TriggerSample1CFN: Type: AWS::Glue::Trigger Properties: Name: Ref: CFNTriggerName Description: Trigger created with CloudFormation Type: SCHEDULED Actions: - JobName: !Ref CFNJobName # Arguments: JSON object # # Run the trigger every 10 minutes on Monday to Friday Schedule: cron(0/10 * ? * MON-FRI *) #Predicate:
AWS Glue 条件触发器的示例 AWS CloudFormation 模板
数据目录中的 AWS Glue 触发器包含在它触发时启动任务运行所需的参数值。在启用条件触发器并满足其条件 (如作业成功完成) 时,该条件触发器触发。
此示例创建启动一个名为 cfn-job-S3-to-S3-1
的作业的条件触发器。在名为 cfn-job-S3-to-S3-2
的作业成功完成时,此作业启动。
--- AWSTemplateFormatVersion: '2010-09-09' # Sample CFN YAML to demonstrate creating a conditional trigger for a job, which starts when another job completes # # Parameters section contains names that are substituted in the Resources section # These parameters are the names the resources created in the Data Catalog Parameters: # The existing job to be started by this trigger CFNJobName: Type: String Default: cfn-job-S3-to-S3-1 # The existing job that when it finishes causes trigger to fire CFNJobName2: Type: String Default: cfn-job-S3-to-S3-2 # The name of the trigger to be created CFNTriggerName: Type: String Default: cfn-trigger-conditional-1 # Resources: # Create trigger to run an existing job (CFNJobName) when another job completes (CFNJobName2). CFNTriggerSample: Type: AWS::Glue::Trigger Properties: Name: Ref: CFNTriggerName Description: Trigger created with CloudFormation Type: CONDITIONAL Actions: - JobName: !Ref CFNJobName # Arguments: JSON object #Schedule: none Predicate: #Value for Logical is required if more than 1 job listed in Conditions Logical: AND Conditions: - LogicalOperator: EQUALS JobName: !Ref CFNJobName2 State: SUCCEEDED
AWS Glue 开发终端节点的示例 AWS CloudFormation 模板
AWS Glue 机器学习转换是用于清理数据的自定义转换。当前有一个名为 FindMatches 的可用转换。通过 FindMatches 转换,您可以识别数据集中的重复或匹配记录,即使记录没有公共唯一标识符且没有完全匹配的字段也是如此。
此示例创建机器学习转换。有关创建机器学习转换所需参数的更多信息,请参阅 与 AWS Lake Formation FindMatches 匹配的记录。
--- AWSTemplateFormatVersion: '2010-09-09' # Sample CFN YAML to demonstrate creating a machine learning transform # # Resources section defines metadata for the machine learning transform Resources: MyMLTransform: Type: "AWS::Glue::MLTransform" Condition: "isGlueMLGARegion" Properties: Name: !Sub "MyTransform" Description: "The bestest transform ever" Role: !ImportValue MyMLTransformUserRole GlueVersion: "1.0" WorkerType: "Standard" NumberOfWorkers: 5 Timeout: 120 MaxRetries: 1 InputRecordTables: GlueTables: - DatabaseName: !ImportValue MyMLTransformDatabase TableName: !ImportValue MyMLTransformTable TransformParameters: TransformType: "FIND_MATCHES" FindMatchesParameters: PrimaryKeyColumnName: "testcolumn" PrecisionRecallTradeoff: 0.5 AccuracyCostTradeoff: 0.5 EnforceProvidedLabels: True Tags: key1: "value1" key2: "value2" TransformEncryption: TaskRunSecurityConfigurationName: !ImportValue MyMLTransformSecurityConfiguration MLUserDataEncryption: MLUserDataEncryptionMode: "SSE-KMS" KmsKeyId: !ImportValue MyMLTransformEncryptionKey
AWS Glue Data Quality 规则集的示例 AWS CloudFormation 模板
AWS Glue 数据质量规则集包含可以在 Data Catalog 中的表上进行评估的规则。将规则集放在目标表上后,您可以进入 Data Catalog 并运行评估,根据规则集中的这些规则运行数据。这些规则可能各不相同,从评估行数到评估数据的引用完整性。
以下示例是 CloudFormation 模板,该模板在指定的目标表上创建包含各种规则的规则集。
AWSTemplateFormatVersion: '2010-09-09' # Sample CFN YAML to demonstrate creating a DataQualityRuleset # # Parameters section contains names that are substituted in the Resources section # These parameters are the names the resources created in the Data Catalog Parameters: # The name of the ruleset to be created RulesetName: Type: String Default: "CFNRulesetName" RulesetDescription: Type: String Default: "CFN DataQualityRuleset" # Rules that will be associated with this ruleset Rules: Type: String Default: 'Rules = [ RowCount > 100, IsUnique "id", IsComplete "nametype" ]' # Name of database and table within Data Catalog which the ruleset will # be applied too DatabaseName: Type: String Default: "ExampleDatabaseName" TableName: Type: String Default: "ExampleTableName" # Resources section defines metadata for the Data Catalog Resources: # Creates a Data Quality ruleset under specified rules DQRuleset: Type: AWS::Glue::DataQualityRuleset Properties: Name: !Ref RulesetName Description: !Ref RulesetDescription # The String within rules must be formatted in DQDL, a language # used specifically to make rules Ruleset: !Ref Rules # The targeted table must exist within Data Catalog alongside # the correct database TargetTable: DatabaseName: !Ref DatabaseName TableName: !Ref TableName
使用 EventBridge 调度器的 AWS Glue Data Quality 规则集的示例 AWS CloudFormation 模板
AWS Glue 数据质量规则集包含可以在 Data Catalog 中的表上进行评估的规则。将规则集放在目标表上后,您可以进入 Data Catalog 并运行评估,根据规则集中的这些规则运行数据。您不必手动进入 Data Catalog 来评估规则集,而是可以在我们的 CloudFormation 模板中添加 EventBridge 调度器,按定时间隔为您计划这些规则集评估。
以下示例是一个 CloudFormation 模板,它创建了一个 Data Quality 规则集和一个 EventBridge 调度器,用于每五分钟评估一次上述规则集。
AWSTemplateFormatVersion: '2010-09-09' # Sample CFN YAML to demonstrate creating a DataQualityRuleset # # Parameters section contains names that are substituted in the Resources section # These parameters are the names the resources created in the Data Catalog Parameters: # The name of the ruleset to be created RulesetName: Type: String Default: "CFNRulesetName" # Rules that will be associated with this Ruleset Rules: Type: String Default: 'Rules = [ RowCount > 100, IsUnique "id", IsComplete "nametype" ]' # The name of the Schedule to be created ScheduleName: Type: String Default: "ScheduleDQRulsetEvaluation" # This expression determines the rate at which the Schedule will evaluate # your data using the above ruleset ScheduleRate: Type: String Default: "rate(5 minutes)" # The Request that being sent must match the details of the Data Quality Ruleset ScheduleRequest: Type: String Default: ' { "DataSource": { "GlueTable": { "DatabaseName": "ExampleDatabaseName", "TableName": "ExampleTableName" } }, "Role": "role/AWSGlueServiceRoleDefault", "RulesetNames": [ ""CFNRulesetName"" ] } ' # Resources section defines metadata for the Data Catalog Resources: # Creates a Data Quality ruleset under specified rules DQRuleset: Type: AWS::Glue::DataQualityRuleset Properties: Name: !Ref RulesetName Description: "CFN DataQualityRuleset" # The String within rules must be formatted in DQDL, a language # used specifically to make rules Ruleset: !Ref Rules # The targeted table must exist within Data Catalog alongside # the correct database TargetTable: DatabaseName: "ExampleDatabaseName" TableName: "ExampleTableName" # Create a Scheduler to schedule evaluation runs on the above ruleset ScheduleDQEval: Type: AWS::Scheduler::Schedule Properties: Name: !Ref ScheduleName Description: "Schedule DataQualityRuleset Evaluations" FlexibleTimeWindow: Mode: "OFF" ScheduleExpression: !Ref ScheduleRate ScheduleExpressionTimezone: "America/New_York" State: "ENABLED" Target: # The ARN is the API that will be run, since we want to evaluate our ruleset # we want this specific ARN Arn: "arn:aws:scheduler:::aws-sdk:glue:startDataQualityRulesetEvaluationRun" # Your RoleArn must have approval to schedule RoleArn: "arn:aws:iam::123456789012:role/AWSGlueServiceRoleDefault" # This is the Request that is being sent to the Arn Input: ' { "DataSource": { "GlueTable": { "DatabaseName": "sampledb", "TableName": "meteorite" } }, "Role": "role/AWSGlueServiceRoleDefault", "RulesetNames": [ "TestCFN" ] } '
AWS Glue 开发终端节点的示例 AWS CloudFormation 模板
AWS Glue 开发终端节点是可用于开发和测试您的 AWS Glue 脚本的环境。
此示例使用成功创建开发终端节点所需的最少网络参数值创建开发终端节点。有关设置开发终端节点所需的参数的更多信息,请参阅为 AWS Glue 设置开发网络。
您需要提供现有的 IAM 角色 ARN(Amazon Resource Name)来创建开发终端节点。如果您计划在开发终端节点上创建笔记本电脑服务器,请提供有效的 RSA 公有密钥并保持对应的私有密钥可用。
注意
对于您创建的任何与开发终端节点关联的 notebook 服务器,都可以对其进行管理。因此,如果您删除开发终端节点以删除笔记本服务器,您必须在 AWS CloudFormation 控制台上删除 AWS CloudFormation 堆栈。
--- AWSTemplateFormatVersion: '2010-09-09' # Sample CFN YAML to demonstrate creating a development endpoint # # Parameters section contains names that are substituted in the Resources section # These parameters are the names the resources created in the Data Catalog Parameters: # The name of the crawler to be created CFNEndpointName: Type: String Default: cfn-devendpoint-1 CFNIAMRoleArn: Type: String Default: arn:aws:iam::123456789012/role/AWSGlueServiceRoleGA # # # Resources section defines metadata for the Data Catalog Resources: CFNDevEndpoint: Type: AWS::Glue::DevEndpoint Properties: EndpointName: !Ref CFNEndpointName #ExtraJarsS3Path: String #ExtraPythonLibsS3Path: String NumberOfNodes: 5 PublicKey: ssh-rsa public.....key myuserid-key RoleArn: !Ref CFNIAMRoleArn SecurityGroupIds: - sg-64986c0b SubnetId: subnet-c67cccac