AWS::Glue::Crawler
The AWS::Glue::Crawler
resource specifies an AWS Glue crawler. For more
information, see Cataloging Tables with a Crawler and Crawler Structure in the
AWS Glue Developer
Guide.
Syntax
To declare this entity in your AWS CloudFormation template, use the following syntax:
JSON
{ "Type" : "AWS::Glue::Crawler", "Properties" : { "Classifiers" :
[ String, ... ]
, "Configuration" :String
, "CrawlerSecurityConfiguration" :String
, "DatabaseName" :String
, "Description" :String
, "LakeFormationConfiguration" :LakeFormationConfiguration
, "Name" :String
, "RecrawlPolicy" :RecrawlPolicy
, "Role" :String
, "Schedule" :Schedule
, "SchemaChangePolicy" :SchemaChangePolicy
, "TablePrefix" :String
, "Tags" :[
, "Targets" :Tag
, ... ]Targets
} }
YAML
Type: AWS::Glue::Crawler Properties: Classifiers:
- String
Configuration:String
CrawlerSecurityConfiguration:String
DatabaseName:String
Description:String
LakeFormationConfiguration:LakeFormationConfiguration
Name:String
RecrawlPolicy:RecrawlPolicy
Role:String
Schedule:Schedule
SchemaChangePolicy:SchemaChangePolicy
TablePrefix:String
Tags:-
Targets:Tag
Targets
Properties
Classifiers
-
A list of UTF-8 strings that specify the names of custom classifiers that are associated with the crawler.
Required: No
Type: Array of String
Update requires: No interruption
Configuration
-
Crawler configuration information. This versioned JSON string allows users to specify aspects of a crawler's behavior. For more information, see Configuring a Crawler.
Required: No
Type: String
Update requires: No interruption
CrawlerSecurityConfiguration
-
The name of the
SecurityConfiguration
structure to be used by this crawler.Required: No
Type: String
Minimum:
0
Maximum:
128
Update requires: No interruption
DatabaseName
-
The name of the database in which the crawler's output is stored.
Required: No
Type: String
Update requires: No interruption
Description
-
A description of the crawler.
Required: No
Type: String
Pattern:
[\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\r\n\t]*
Minimum:
0
Maximum:
2048
Update requires: No interruption
LakeFormationConfiguration
-
Specifies whether the crawler should use AWS Lake Formation credentials for the crawler instead of the IAM role credentials.
Required: No
Type: LakeFormationConfiguration
Update requires: No interruption
Name
-
The name of the crawler.
Required: No
Type: String
Pattern:
[\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\t]*
Minimum:
1
Maximum:
255
Update requires: Replacement
RecrawlPolicy
-
A policy that specifies whether to crawl the entire dataset again, or to crawl only folders that were added since the last crawler run.
Required: No
Type: RecrawlPolicy
Update requires: No interruption
Role
-
The Amazon Resource Name (ARN) of an IAM role that's used to access customer resources, such as Amazon Simple Storage Service (Amazon S3) data.
Required: Yes
Type: String
Update requires: No interruption
Schedule
-
For scheduled crawlers, the schedule when the crawler runs.
Required: No
Type: Schedule
Update requires: No interruption
SchemaChangePolicy
-
The policy that specifies update and delete behaviors for the crawler. The policy tells the crawler what to do in the event that it detects a change in a table that already exists in the customer's database at the time of the crawl. The
SchemaChangePolicy
does not affect whether or how new tables and partitions are added. New tables and partitions are always created regardless of theSchemaChangePolicy
on a crawler.The SchemaChangePolicy consists of two components,
UpdateBehavior
andDeleteBehavior
.Required: No
Type: SchemaChangePolicy
Update requires: No interruption
TablePrefix
-
The prefix added to the names of tables that are created.
Required: No
Type: String
Minimum:
0
Maximum:
128
Update requires: No interruption
-
The tags to use with this crawler.
Required: No
Type: Array of
Tag
Update requires: No interruption
Targets
-
A collection of targets to crawl.
Required: Yes
Type: Targets
Update requires: No interruption
Return values
Ref
When you pass the logical ID of this resource to the intrinsic Ref
function, Ref
returns the crawler name.
For more information about using the Ref
function, see Ref
.
Examples
Create a crawler
The following example creates a crawler for an Amazon S3 target.
JSON
{ "Description": "AWS Glue crawler test", "Resources": { "MyRole": { "Type": "AWS::IAM::Role", "Properties": { "AssumeRolePolicyDocument": { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": [ "glue.amazonaws.com" ] }, "Action": [ "sts:AssumeRole" ] } ] }, "Path": "/", "ManagedPolicyArns": ["arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole"], "Policies": [ { "PolicyName": "S3BucketAccessPolicy", "PolicyDocument": { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject" ], "Resource": { "Fn::Join": [ "", [ { "Fn::GetAtt": ["MyS3Bucket", "Arn"] }, "*" ] ] } } ] } } ] } }, "MyDatabase": { "Type": "AWS::Glue::Database", "Properties": { "CatalogId": { "Ref": "AWS::AccountId" }, "DatabaseInput": { "Name": "dbcrawler", "Description": "TestDatabaseDescription", "LocationUri": "TestLocationUri", "Parameters": { "key1": "value1", "key2": "value2" } } } }, "MyClassifier": { "Type": "AWS::Glue::Classifier", "Properties": { "GrokClassifier": { "Name": "CrawlerClassifier", "Classification": "wikiData", "GrokPattern": "%{NOTSPACE:language} %{NOTSPACE:page_title} %{NUMBER:hits:long} %{NUMBER:retrieved_size:long}" } } }, "MyS3Bucket": { "Type": "AWS::S3::Bucket", "Properties": { "BucketName": "crawlertesttarget", "AccessControl": "BucketOwnerFullControl" } }, "MyCrawler2": { "Type": "AWS::Glue::Crawler", "Properties": { "Name": "testcrawler1", "Role": { "Fn::GetAtt": [ "MyRole", "Arn" ] }, "DatabaseName": { "Ref": "MyDatabase" }, "Classifiers": [ { "Ref": "MyClassifier" } ], "Targets": { "S3Targets": [ { "Path": { "Ref": "MyS3Bucket" } } ] }, "SchemaChangePolicy": { "UpdateBehavior": "UPDATE_IN_DATABASE", "DeleteBehavior": "LOG" }, "Tags": { "key1": "value1" }, "Schedule": { "ScheduleExpression": "cron(0/10 * ? * MON-FRI *)" } } } } }
YAML
Resources: MyRole: Type: AWS::IAM::Role Properties: AssumeRolePolicyDocument: Version: "2012-10-17" Statement: - Effect: "Allow" Principal: Service: - "glue.amazonaws.com" Action: - "sts:AssumeRole" Path: "/" ManagedPolicyArns: ['arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole'] Policies: - PolicyName: "S3BucketAccessPolicy" PolicyDocument: Version: "2012-10-17" Statement: - Effect: "Allow" Action: - "s3:GetObject" - "s3:PutObject" Resource: !Join - '' - - !GetAtt MyS3Bucket.Arn - "*" MyDatabase: Type: AWS::Glue::Database Properties: CatalogId: !Ref AWS::AccountId DatabaseInput: Name: "dbcrawler" Description: "TestDatabaseDescription" LocationUri: "TestLocationUri" Parameters: key1 : "value1" key2 : "value2" MyClassifier: Type: AWS::Glue::Classifier Properties: GrokClassifier: Name: "CrawlerClassifier" Classification: "wikiData" GrokPattern: "%{NOTSPACE:language} %{NOTSPACE:page_title} %{NUMBER:hits:long} %{NUMBER:retrieved_size:long}" MyS3Bucket: Type: AWS::S3::Bucket Properties: BucketName: "crawlertesttarget" AccessControl: "BucketOwnerFullControl" MyCrawler2: Type: AWS::Glue::Crawler Properties: Name: "testcrawler1" Role: !GetAtt MyRole.Arn DatabaseName: !Ref MyDatabase Classifiers: - !Ref MyClassifier Targets: S3Targets: - Path: !Ref MyS3Bucket SchemaChangePolicy: UpdateBehavior: "UPDATE_IN_DATABASE" DeleteBehavior: "LOG" Tags: "Key1": "Value1" Schedule: ScheduleExpression: "cron(0/10 * ? * MON-FRI *)"
Crawler Configuration
The following example specifies a configuration that controls a crawler's behavior.
JSON
{ "Type": "AWS::Glue::Crawler", "Properties": { "Role": "role1", "Classifiers": [], "Description": "example classifier", "SchemaChangePolicy": "", "Schedule": "Schedule", "DatabaseName": "test", "Targets": [], "TablePrefix": "test-", "Name": "my-crawler", "Configuration": "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}" } }
YAML
Type: AWS::Glue::Crawler Properties: Role: role1 Classifiers: - '' Description: example classifier SchemaChangePolicy: '' Schedule: Schedule DatabaseName: test Targets: - '' TablePrefix: test- Name: my-crawler Configuration: "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}"