Configuring persistence for notebooks in Amazon S3
You can configure a JupyterHub cluster in Amazon EMR so that notebooks saved by a user persist in Amazon S3, outside of ephemeral storage on cluster EC2 instances.
You specify Amazon S3 persistence using the jupyter-s3-conf
configuration classification when you create a cluster. For more information, see Configure applications.
In addition to enabling Amazon S3 persistence using the s3.persistence.enabled
property, you specify a bucket in Amazon S3 where notebooks are saved using the s3.persistence.bucket
property. Notebooks for each user are saved to a jupyter/
folder in the specified bucket. The bucket must already exist in Amazon S3, and the role for the EC2 instance profile that you specify when you create the cluster must have permissions to the bucket (by default, the role is jupyterhub-user-name
EMR_EC2_DefaultRole
). For more information, see Configure IAM roles for Amazon EMR permissions to AWS services.
When you launch a new cluster using the same configuration classification properties, users can open notebooks with the content from the saved location.
Note that when you import files as modules in a notebook when you have Amazon S3 enabled, this will result in the files uploading to Amazon S3. When you import files without enabling Amazon S3 persistence, they upload to your JupyterHub container.
The following example enables Amazon S3 persistence. Notebooks saved by users are saved in the s3://MyJupyterBackups/jupyter/
folder for each user, where jupyterhub-user-name
is a user name, such as jupyterhub-user-name
diego
.
[ { "Classification": "jupyter-s3-conf", "Properties": { "s3.persistence.enabled": "true", "s3.persistence.bucket": "MyJupyterBackups" } } ]