AWS::Glue::Crawler - AWS CloudFormation


The AWS::Glue::Crawler resource specifies an AWS Glue crawler. For more information, see Cataloging Tables with a Crawler and Crawler Structure in the AWS Glue Developer Guide.


To declare this entity in your AWS CloudFormation template, use the following syntax:


{ "Type" : "AWS::Glue::Crawler", "Properties" : { "Classifiers" : [ String, ... ], "Configuration" : String, "CrawlerSecurityConfiguration" : String, "DatabaseName" : String, "Description" : String, "LakeFormationConfiguration" : LakeFormationConfiguration, "Name" : String, "RecrawlPolicy" : RecrawlPolicy, "Role" : String, "Schedule" : Schedule, "SchemaChangePolicy" : SchemaChangePolicy, "TablePrefix" : String, "Tags" : [ Tag, ... ], "Targets" : Targets } }



A list of UTF-8 strings that specify the names of custom classifiers that are associated with the crawler.

Required: No

Type: Array of String

Update requires: No interruption


Crawler configuration information. This versioned JSON string allows users to specify aspects of a crawler's behavior. For more information, see Configuring a Crawler.

Required: No

Type: String

Update requires: No interruption


The name of the SecurityConfiguration structure to be used by this crawler.

Required: No

Type: String

Minimum: 0

Maximum: 128

Update requires: No interruption


The name of the database in which the crawler's output is stored.

Required: No

Type: String

Update requires: No interruption


A description of the crawler.

Required: No

Type: String

Pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\r\n\t]*

Minimum: 0

Maximum: 2048

Update requires: No interruption


Specifies whether the crawler should use AWS Lake Formation credentials for the crawler instead of the IAM role credentials.

Required: No

Type: LakeFormationConfiguration

Update requires: No interruption


The name of the crawler.

Required: No

Type: String

Pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\t]*

Minimum: 1

Maximum: 255

Update requires: Replacement


A policy that specifies whether to crawl the entire dataset again, or to crawl only folders that were added since the last crawler run.

Required: No

Type: RecrawlPolicy

Update requires: No interruption


The Amazon Resource Name (ARN) of an IAM role that's used to access customer resources, such as Amazon Simple Storage Service (Amazon S3) data.

Required: Yes

Type: String

Update requires: No interruption


For scheduled crawlers, the schedule when the crawler runs.

Required: No

Type: Schedule

Update requires: No interruption


The policy that specifies update and delete behaviors for the crawler. The policy tells the crawler what to do in the event that it detects a change in a table that already exists in the customer's database at the time of the crawl. The SchemaChangePolicy does not affect whether or how new tables and partitions are added. New tables and partitions are always created regardless of the SchemaChangePolicy on a crawler.

The SchemaChangePolicy consists of two components, UpdateBehavior and DeleteBehavior.

Required: No

Type: SchemaChangePolicy

Update requires: No interruption


The prefix added to the names of tables that are created.

Required: No

Type: String

Minimum: 0

Maximum: 128

Update requires: No interruption


The tags to use with this crawler.

Required: No

Type: Array of Tag

Update requires: No interruption


A collection of targets to crawl.

Required: Yes

Type: Targets

Update requires: No interruption

Return values


When you pass the logical ID of this resource to the intrinsic Ref function, Ref returns the crawler name.

For more information about using the Ref function, see Ref.


Create a crawler

The following example creates a crawler for an Amazon S3 target.


{ "Description": "AWS Glue crawler test", "Resources": { "MyRole": { "Type": "AWS::IAM::Role", "Properties": { "AssumeRolePolicyDocument": { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": [ "" ] }, "Action": [ "sts:AssumeRole" ] } ] }, "Path": "/", "ManagedPolicyArns": ["arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole"], "Policies": [ { "PolicyName": "S3BucketAccessPolicy", "PolicyDocument": { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject" ], "Resource": { "Fn::Join": [ "", [ { "Fn::GetAtt": ["MyS3Bucket", "Arn"] }, "*" ] ] } } ] } } ] } }, "MyDatabase": { "Type": "AWS::Glue::Database", "Properties": { "CatalogId": { "Ref": "AWS::AccountId" }, "DatabaseInput": { "Name": "dbcrawler", "Description": "TestDatabaseDescription", "LocationUri": "TestLocationUri", "Parameters": { "key1": "value1", "key2": "value2" } } } }, "MyClassifier": { "Type": "AWS::Glue::Classifier", "Properties": { "GrokClassifier": { "Name": "CrawlerClassifier", "Classification": "wikiData", "GrokPattern": "%{NOTSPACE:language} %{NOTSPACE:page_title} %{NUMBER:hits:long} %{NUMBER:retrieved_size:long}" } } }, "MyS3Bucket": { "Type": "AWS::S3::Bucket", "Properties": { "BucketName": "crawlertesttarget", "AccessControl": "BucketOwnerFullControl" } }, "MyCrawler2": { "Type": "AWS::Glue::Crawler", "Properties": { "Name": "testcrawler1", "Role": { "Fn::GetAtt": [ "MyRole", "Arn" ] }, "DatabaseName": { "Ref": "MyDatabase" }, "Classifiers": [ { "Ref": "MyClassifier" } ], "Targets": { "S3Targets": [ { "Path": { "Ref": "MyS3Bucket" } } ] }, "SchemaChangePolicy": { "UpdateBehavior": "UPDATE_IN_DATABASE", "DeleteBehavior": "LOG" }, "Tags": { "key1": "value1" }, "Schedule": { "ScheduleExpression": "cron(0/10 * ? * MON-FRI *)" } } } } }


Resources: MyRole: Type: AWS::IAM::Role Properties: AssumeRolePolicyDocument: Version: "2012-10-17" Statement: - Effect: "Allow" Principal: Service: - "" Action: - "sts:AssumeRole" Path: "/" ManagedPolicyArns: ['arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole'] Policies: - PolicyName: "S3BucketAccessPolicy" PolicyDocument: Version: "2012-10-17" Statement: - Effect: "Allow" Action: - "s3:GetObject" - "s3:PutObject" Resource: !Join - '' - - !GetAtt MyS3Bucket.Arn - "*" MyDatabase: Type: AWS::Glue::Database Properties: CatalogId: !Ref AWS::AccountId DatabaseInput: Name: "dbcrawler" Description: "TestDatabaseDescription" LocationUri: "TestLocationUri" Parameters: key1 : "value1" key2 : "value2" MyClassifier: Type: AWS::Glue::Classifier Properties: GrokClassifier: Name: "CrawlerClassifier" Classification: "wikiData" GrokPattern: "%{NOTSPACE:language} %{NOTSPACE:page_title} %{NUMBER:hits:long} %{NUMBER:retrieved_size:long}" MyS3Bucket: Type: AWS::S3::Bucket Properties: BucketName: "crawlertesttarget" AccessControl: "BucketOwnerFullControl" MyCrawler2: Type: AWS::Glue::Crawler Properties: Name: "testcrawler1" Role: !GetAtt MyRole.Arn DatabaseName: !Ref MyDatabase Classifiers: - !Ref MyClassifier Targets: S3Targets: - Path: !Ref MyS3Bucket SchemaChangePolicy: UpdateBehavior: "UPDATE_IN_DATABASE" DeleteBehavior: "LOG" Tags: "Key1": "Value1" Schedule: ScheduleExpression: "cron(0/10 * ? * MON-FRI *)"

Crawler Configuration

The following example specifies a configuration that controls a crawler's behavior.


{ "Type": "AWS::Glue::Crawler", "Properties": { "Role": "role1", "Classifiers": [], "Description": "example classifier", "SchemaChangePolicy": "", "Schedule": "Schedule", "DatabaseName": "test", "Targets": [], "TablePrefix": "test-", "Name": "my-crawler", "Configuration": "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}" } }


Type: AWS::Glue::Crawler Properties: Role: role1 Classifiers: - '' Description: example classifier SchemaChangePolicy: '' Schedule: Schedule DatabaseName: test Targets: - '' TablePrefix: test- Name: my-crawler Configuration: "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}"