Configuring VPC access for EMR Serverless applications to connect to data
You can configure EMR Serverless applications to connect to your data stores within your VPC, such as Amazon Redshift clusters, Amazon RDS databases or Amazon S3 buckets with VPC endpoints. Your EMR Serverless application has outbound connectivity to the data stores within your VPC. By default, EMR Serverless blocks inbound access to your applications to improve security.
Note
You must configure VPC access if you want to use an external Hive metastore database for your application. For information about how to configure an external Hive metastore, see Metastore configuration.
Create application
On the Create application page, you can choose custom settings and specify the VPC, subnets and security groups that EMR Serverless applications can use.
VPCs
Choose the name of the virtual private cloud (VPC) that contains your data stores. The Create application page lists all VPCs for your chosen AWS Region.
Subnets
Choose the subnets within the VPC that contains your data store. The Create application page lists all subnets for the data stores in your VPC.
The subnets selected must be private subnets. This means that the associated route tables for the subnets should not have internet gateways.
For outbound connectivity to the internet, the subnets must have outbound routes using a NAT Gateway. To configure a NAT Gateway, see Work with NAT gateways.
For Amazon S3 connectivity, the subnets must have a NAT Gateway or a VPC endpoint configured. To configure an S3 VPC endpoint, see Create a gateway endpoint.
For connectivity to other AWS services outside the VPC, such as Amazon DynamoDB, you must configure either VPC endpoints or a NAT gateway. To configure VPC endpoints for AWS services, see Work with VPC endpoints.
Workers can connect to the data stores within your VPC through outbound traffic. By default, EMR Serverless blocks inbound access to workers to improve security.
When you use AWS Config, EMR Serverless creates an elastic network interface item record
for every worker. To avoid costs related to this resource, consider turning off
AWS::EC2::NetworkInterface
in AWS Config.
Note
We recommend that you select multiple subnets across multiple Availability Zones. This is because the subnets that you choose determine the Availability Zones that are available for an EMR Serverless application to launch. Each worker will consume an IP address on the subnet where it is launched. Please ensure that the specified subnets have sufficient IP addresses for the number of workers you plan to launch. For more information on subnet planning, see Best practices for subnet planning.
Security groups
Choose one or more security groups that can communicate with your data stores. The Create application page lists all security groups in your VPC. EMR Serverless associates these security groups with elastic network interfaces that are attached to your VPC subnets.
Note
We recommend that you create a separate security group for EMR Serverless applications. This makes isolating and managing network rules more efficient. For example, to communicate with Amazon Redshift clusters, you can define the traffic rules between the Redshift and EMR Serverless security groups, as demonstrated in the example below.
Example — Communication with Amazon Redshift clusters
-
Add a rule for inbound traffic to the Amazon Redshift security group from one of the EMR Serverless security groups.
Type Protocol Port range Source All TCP
TCP
5439
emr-serverless-security-group
-
Add a rule for outbound traffic from one of the EMR Serverless security groups. You can do this in one of two ways. First, you can open outbound traffic to all ports.
Type Protocol Port range Destination All traffic
TCP
ALL
0.0.0.0/0
Alternatively, you can restrict outbound traffic to Amazon Redshift clusters. This is useful only when the application must communicate with Amazon Redshift clusters and nothing else.
Type Protocol Port range Source All TCP
TCP
5439
redshift-security-group
Configure application
You can change the network configuration for an existing EMR Serverless application from the Configure application page.
View job run details
On the Job run detail page, you can view the subnet used by your job for a specific run. Note that a job runs only in one subnet selected from the specified subnets.
Best practices for subnet planning
AWS resources are created in a subnet which is a subset of available IP addresses in an Amazon VPC. For example, a VPC with a /16 netmask has up to 65,536 available IP addresses which can be broken into multiple smaller networks using subnet masks. As an example, you can split this range into two subnets with each using /17 mask and 32,768 available IP addresses. A subnet resides within an Availability Zone and cannot span across zones.
The subnets should be designed keeping in mind your EMR Serverless application scaling limits. For example, if you have an application requesting 4 vCpu workers and can scale up to 4,000 vCpu, then your application will require at most 1,000 workers for a total of 1,000 network interfaces. We recommend that you create subnets across multiple Availability Zones. This allows EMR Serverless to retry your job or provision pre-initialized capacity in a different Availability Zone in an unlikely event when an Availability Zone fails. Therefore, each subnet in at least two Availability Zones should have more than 1,000 available IP addresses.
You need subnets with mask size lower than or equal to 22 to provision 1,000 network interfaces. Any mask greater than 22 will not meet the requirement. For example, a subnet mask of /23 provides 512 IP addresses, while a mask of /22 provides 1024 and a mask of /21 provides 2048 IP addresses. Below is an example of 4 subnets with /22 mask in a VPC of /16 netmask that can be allocated to different Availability Zones. There is a difference of five between available and usable IP addresses because first four IP addresses and last IP address in each subnet is reserved by AWS.
Subnet ID | Subnet Address | Subnet Mask | IP Address Range | Available IP Addresses | Usable IP Addresses |
---|---|---|---|---|---|
1 |
10.0.0.0 |
255.255.252.0/22 |
10.0.0.0 - 10.0.3.255 |
1,024 |
1,019 |
2 |
10.0.4.0 |
255.255.252.0/22 |
10.0.4.0 - 10.0.7.255 |
1,024 |
1,019 |
3 |
10.0.8.0 |
255.255.252.0/22 |
10.0.4.0 - 10.0.7.255 |
1,024 |
1,019 |
4 |
10.0.12.0 |
255.255.252.0/22 |
10.0.12.0 - 10.0.15.255 |
1,024 |
1,019 |
You should evaluate if your workload is best suited for larger worker sizes. Using larger worker sizes requires fewer network interfaces. For example, using 16vCpu workers with an application scaling limit of 4,000 vCpu will require at most 250 workers for a total of 250 available IP addresses to provision network interfaces. You need subnets in multiple Availability Zones with mask size lower than or equal to 24 to provision 250 network interfaces. Any mask size greater than 24 offers less than 250 IP addresses.
If you share subnets across multiple applications, each subnet should be designed keeping in mind collective scaling limits of all your applications. For example, if you have 3 applications requesting 4 vCpu workers and each can scale up to 4000 vCpu with 12,000 vCpu account-level service based quota, each subnet will require 3000 available IP addresses. If the VPC that you want to use doesn't have a sufficient number of IP addresses, try to increase the number of available IP addresses. You can do this by associating additional Classless Inter-Domain Routing (CIDR) blocks with your VPC. For more information, see Associate additional IPv4 CIDR blocks with your VPC in the Amazon VPC User Guide.
You can use one of the many tools available online to quickly generate subnet definitions and review their available range of IP addresses.