Requirements, differences in release versions, and security for EMR Notebooks - Amazon EMR

Requirements, differences in release versions, and security for EMR Notebooks

Note

EMR Notebooks are available as EMR Studio Workspaces in the console. The Create Workspace button in the console lets you create new notebooks. To access or create Workspaces, EMR Notebooks users need additional IAM role permissions. For more information, see Amazon EMR Notebooks are Amazon EMR Studio Workspaces in the console and Amazon EMR console.

Consider the following requirements, differences in release versions, security information and other considerations when you create clusters and develop solutions using EMR notebook.

Cluster requirements

  • Enable Amazon EMR Block Public Access – Inbound access to a cluster enables cluster users to execute notebook kernels. Ensure that only authorized users can access the cluster. We strongly recommend that you leave block public access enabled, and that you limit inbound SSH traffic to only trusted sources. For more information, see Using Amazon EMR block public access and Control network traffic with security groups for your Amazon EMR cluster.

  • Use a Compatible Cluster – A cluster attached to a notebook must meet the following requirements:

    • Only clusters created using Amazon EMR are supported. You can create a cluster independently within Amazon EMR and then attach an EMR notebook, or you can create a compatible cluster when you create an EMR notebook.

    • Only clusters created using Amazon EMR release version 5.18.0 and later are supported. See Differences in capabilities by cluster release version.

    • Clusters created using Amazon EC2 instances with AMD EPYC processors—for example, m5a.* and r5a.* instance types—are not supported.

    • EMR Notebooks works only with clusters created with VisibleToAllUsers set to true. VisibleToAllUsers is true by default.

    • The cluster must be launched within an EC2-VPC. Public and private subnets are supported. The EC2-Classic platform is not supported.

    • The cluster must be launched with Hadoop, Spark, and Livy installed. Other applications may be installed, but EMR Notebooks currently supports Spark clusters only.

      Important

      For Amazon EMR release versions 5.32.0 and later, or 6.2.0 and later, your cluster must also be running the Jupyter Enterprise Gateway application in order to work with EMR Notebooks.

    • Clusters using Kerberos authentication are not supported.

    • Clusters integrated with AWS Lake Formation support the installation of notebook-scoped libraries only. Installing kernels and libraries on the cluster are not supported.

    • Clusters with multiple primary nodes are not supported.

    • Clusters using Amazon EC2 instances based on AWS Graviton2 are not supported.

Differences in capabilities by cluster release version

We strongly recommend that you use EMR Notebooks with clusters created using Amazon EMR release versions 5.30.0, 5.32.0 or later, or 6.2.0 or later. With these versions, EMR Notebooks runs kernels on the attached Amazon EMR cluster. Kernels and libraries can be installed directly on the cluster primary node. Using EMR Notebooks with these cluster versions has the following benefits:

  • Improved performance – Notebook kernels run on clusters with EC2 instance types that you select. Earlier versions run kernels on a specialized instance that cannot be resized, accessed, or customized.

  • Ability to add and customize kernels – You can connect to the cluster to install kernel packages using conda and pip. In addition, pip installation is supported using terminal commands within notebook cells. In earlier versions, only pre-installed kernels were available (Python, PySpark, Spark, and SparkR). For more information, see Installing kernels and Python libraries on a cluster primary node.

  • Ability to install Python libraries – You can install Python libraries on the cluster primary node using conda and pip. We recommend using conda. With earlier versions, only notebook-scoped libraries for PySpark are supported.

Supported EMR Notebooks features by cluster release
Cluster release version Notebook-scoped libraries for PySpark Kernel installation on cluster Python library installation on primary node

Earlier than 5.18.0

EMR Notebooks not supported

5.18.0–5.25.0

No

No

No

5.26.0–5.29.0

Yes

No

No

5.30.0

Yes

Yes

Yes

6.0.0

No

No

No

5.32.0 and later, and 6.2.0 and later Yes Yes Yes

Limits for concurrently attached EMR Notebooks

When you create a cluster that supports notebooks, consider the EC2 Instance type of the cluster primary node. The memory constraints of this EC2 Instance determine the number of notebooks that can be ready simultaneously to run code and queries on the cluster.

Primary node EC2 instance type Number of EMR Notebooks

*.medium

2

*.large

4

*.xlarge

8

*.2xlarge

16

*.4xlarge

24

*.8xlarge

24

*.16xlarge

24

Jupyter Notebook and Python versions

EMR Notebooks runs Jupyter Notebook version 6.0.2 and Python 3.6.5 regardless of the Amazon EMR release version of the attached cluster.

Security-related considerations

Using encrypted S3 locations

If you specify an encrypted location in Amazon S3 to store notebook files, you must set up the Service role for EMR Notebooks as a key user. The default service role is EMR_Notebooks_DefaultRole. If you are using an AWS KMS key for encryption, see Using key policies in AWS KMS in the AWS Key Management Service Developer Guide and the support article for adding key users.

Using cookies with hosting domains

To augment the security for the off-console applications that you might use with Amazon EMR, the application hosting domains are registered in the Public Suffix List (PSL). Examples of these hosting domains include the following: emrstudio-prod.us-east-1.amazonaws.com, emrnotebooks-prod.us-east-1.amazonaws.com, emrappui-prod.us-east-1.amazonaws.com. For further security, if you ever need to set sensitive cookies in the default domain name, we recommend that you use cookies with a __Host- prefix. This helps to defend your domain against cross-site request forgery attempts (CSRF). For more information, see the Set-Cookie page in the Mozilla Developer Network.