AWS::Glue::Crawler - AWS CloudFormation

AWS::Glue::Crawler

The AWS::Glue::Crawler resource specifies an AWS Glue crawler. For more information, see Cataloging Tables with a Crawler and Crawler Structure in the AWS Glue Developer Guide.

Syntax

To declare this entity in your AWS CloudFormation template, use the following syntax:

JSON

{ "Type" : "AWS::Glue::Crawler", "Properties" : { "Classifiers" : [ String, ... ], "Configuration" : String, "CrawlerSecurityConfiguration" : String, "DatabaseName" : String, "Description" : String, "LakeFormationConfiguration" : LakeFormationConfiguration, "Name" : String, "RecrawlPolicy" : RecrawlPolicy, "Role" : String, "Schedule" : Schedule, "SchemaChangePolicy" : SchemaChangePolicy, "TablePrefix" : String, "Tags" : [ Tag, ... ], "Targets" : Targets } }

Properties

Classifiers

A list of UTF-8 strings that specify the names of custom classifiers that are associated with the crawler.

Required: No

Type: Array of String

Update requires: No interruption

Configuration

Crawler configuration information. This versioned JSON string allows users to specify aspects of a crawler's behavior. For more information, see Configuring a Crawler.

Required: No

Type: String

Update requires: No interruption

CrawlerSecurityConfiguration

The name of the SecurityConfiguration structure to be used by this crawler.

Required: No

Type: String

Minimum: 0

Maximum: 128

Update requires: No interruption

DatabaseName

The name of the database in which the crawler's output is stored.

Required: No

Type: String

Update requires: No interruption

Description

A description of the crawler.

Required: No

Type: String

Pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\r\n\t]*

Minimum: 0

Maximum: 2048

Update requires: No interruption

LakeFormationConfiguration

Specifies whether the crawler should use AWS Lake Formation credentials for the crawler instead of the IAM role credentials.

Required: No

Type: LakeFormationConfiguration

Update requires: No interruption

Name

The name of the crawler.

Required: No

Type: String

Pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\t]*

Minimum: 1

Maximum: 255

Update requires: Replacement

RecrawlPolicy

A policy that specifies whether to crawl the entire dataset again, or to crawl only folders that were added since the last crawler run.

Required: No

Type: RecrawlPolicy

Update requires: No interruption

Role

The Amazon Resource Name (ARN) of an IAM role that's used to access customer resources, such as Amazon Simple Storage Service (Amazon S3) data.

Required: Yes

Type: String

Update requires: No interruption

Schedule

For scheduled crawlers, the schedule when the crawler runs.

Required: No

Type: Schedule

Update requires: No interruption

SchemaChangePolicy

The policy that specifies update and delete behaviors for the crawler. The policy tells the crawler what to do in the event that it detects a change in a table that already exists in the customer's database at the time of the crawl. The SchemaChangePolicy does not affect whether or how new tables and partitions are added. New tables and partitions are always created regardless of the SchemaChangePolicy on a crawler.

The SchemaChangePolicy consists of two components, UpdateBehavior and DeleteBehavior.

Required: No

Type: SchemaChangePolicy

Update requires: No interruption

TablePrefix

The prefix added to the names of tables that are created.

Required: No

Type: String

Minimum: 0

Maximum: 128

Update requires: No interruption

Tags

The tags to use with this crawler.

Required: No

Type: Array of Tag

Update requires: No interruption

Targets

A collection of targets to crawl.

Required: Yes

Type: Targets

Update requires: No interruption

Return values

Ref

When you pass the logical ID of this resource to the intrinsic Ref function, Ref returns the crawler name.

For more information about using the Ref function, see Ref.

Examples

Create a crawler

The following example creates a crawler for an Amazon S3 target.

JSON

{ "Description": "AWS Glue crawler test", "Resources": { "MyRole": { "Type": "AWS::IAM::Role", "Properties": { "AssumeRolePolicyDocument": { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": [ "glue.amazonaws.com" ] }, "Action": [ "sts:AssumeRole" ] } ] }, "Path": "/", "ManagedPolicyArns": ["arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole"], "Policies": [ { "PolicyName": "S3BucketAccessPolicy", "PolicyDocument": { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject" ], "Resource": { "Fn::Join": [ "", [ { "Fn::GetAtt": ["MyS3Bucket", "Arn"] }, "*" ] ] } } ] } } ] } }, "MyDatabase": { "Type": "AWS::Glue::Database", "Properties": { "CatalogId": { "Ref": "AWS::AccountId" }, "DatabaseInput": { "Name": "dbcrawler", "Description": "TestDatabaseDescription", "LocationUri": "TestLocationUri", "Parameters": { "key1": "value1", "key2": "value2" } } } }, "MyClassifier": { "Type": "AWS::Glue::Classifier", "Properties": { "GrokClassifier": { "Name": "CrawlerClassifier", "Classification": "wikiData", "GrokPattern": "%{NOTSPACE:language} %{NOTSPACE:page_title} %{NUMBER:hits:long} %{NUMBER:retrieved_size:long}" } } }, "MyS3Bucket": { "Type": "AWS::S3::Bucket", "Properties": { "BucketName": "crawlertesttarget", "AccessControl": "BucketOwnerFullControl" } }, "MyCrawler2": { "Type": "AWS::Glue::Crawler", "Properties": { "Name": "testcrawler1", "Role": { "Fn::GetAtt": [ "MyRole", "Arn" ] }, "DatabaseName": { "Ref": "MyDatabase" }, "Classifiers": [ { "Ref": "MyClassifier" } ], "Targets": { "S3Targets": [ { "Path": { "Ref": "MyS3Bucket" } } ] }, "SchemaChangePolicy": { "UpdateBehavior": "UPDATE_IN_DATABASE", "DeleteBehavior": "LOG" }, "Tags": { "key1": "value1" }, "Schedule": { "ScheduleExpression": "cron(0/10 * ? * MON-FRI *)" } } } } }

YAML

Resources: MyRole: Type: AWS::IAM::Role Properties: AssumeRolePolicyDocument: Version: "2012-10-17" Statement: - Effect: "Allow" Principal: Service: - "glue.amazonaws.com" Action: - "sts:AssumeRole" Path: "/" ManagedPolicyArns: ['arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole'] Policies: - PolicyName: "S3BucketAccessPolicy" PolicyDocument: Version: "2012-10-17" Statement: - Effect: "Allow" Action: - "s3:GetObject" - "s3:PutObject" Resource: !Join - '' - - !GetAtt MyS3Bucket.Arn - "*" MyDatabase: Type: AWS::Glue::Database Properties: CatalogId: !Ref AWS::AccountId DatabaseInput: Name: "dbcrawler" Description: "TestDatabaseDescription" LocationUri: "TestLocationUri" Parameters: key1 : "value1" key2 : "value2" MyClassifier: Type: AWS::Glue::Classifier Properties: GrokClassifier: Name: "CrawlerClassifier" Classification: "wikiData" GrokPattern: "%{NOTSPACE:language} %{NOTSPACE:page_title} %{NUMBER:hits:long} %{NUMBER:retrieved_size:long}" MyS3Bucket: Type: AWS::S3::Bucket Properties: BucketName: "crawlertesttarget" AccessControl: "BucketOwnerFullControl" MyCrawler2: Type: AWS::Glue::Crawler Properties: Name: "testcrawler1" Role: !GetAtt MyRole.Arn DatabaseName: !Ref MyDatabase Classifiers: - !Ref MyClassifier Targets: S3Targets: - Path: !Ref MyS3Bucket SchemaChangePolicy: UpdateBehavior: "UPDATE_IN_DATABASE" DeleteBehavior: "LOG" Tags: "Key1": "Value1" Schedule: ScheduleExpression: "cron(0/10 * ? * MON-FRI *)"

Crawler Configuration

The following example specifies a configuration that controls a crawler's behavior.

JSON

{ "Type": "AWS::Glue::Crawler", "Properties": { "Role": "role1", "Classifiers": [], "Description": "example classifier", "SchemaChangePolicy": "", "Schedule": "Schedule", "DatabaseName": "test", "Targets": [], "TablePrefix": "test-", "Name": "my-crawler", "Configuration": "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}" } }

YAML

Type: AWS::Glue::Crawler Properties: Role: role1 Classifiers: - '' Description: example classifier SchemaChangePolicy: '' Schedule: Schedule DatabaseName: test Targets: - '' TablePrefix: test- Name: my-crawler Configuration: "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}"