Data preparation at scale using Amazon EMR Serverless applications or Amazon EMR clusters in Studio
Amazon SageMaker Studio and its legacy version, Studio Classic, provide data scientists, and machine learning (ML) engineers with tools to perform data analytics and data preparation at scale. Analyzing, transforming, and preparing large amounts of data is a foundational step of any data science and ML workflow. Both Studio and Studio Classic come with built-in integration with Amazon EMR, allowing users to manage large-scale, interactive data preparation and machine learning workflows within their JupyterLab notebooks.
Amazon EMR is a managed big data platform with resources to help you run
petabyte-scale distributed data processing jobs using open-source analytics frameworks on
AWS such as Apache Spark
You should consider Amazon EMR clusters for your data preparation workloads if you have large-scale, long-running, or complex data processing requirements that involve massive amounts of data, require extensive customization and integration with other services, need to run custom applications, or plan to run a diverse range of distributed data processing frameworks beyond just Apache Spark.
Using SageMaker distribution image
1.10
or higher, you can alternatively connect to interactive EMR Serverless applications directly from your JupyterLab notebooks in SageMaker AI
Studio. The integration of Studio with EMR Serverless allows you to run
open-source big data analytics frameworks such as Apache Spark
You should consider EMR Serverless for your interactive data preparation workloads if your workloads are short-lived or intermittent and don't require a persistent cluster; you prefer a serverless experience with automatic resource provisioning and termination, avoiding the overhead of managing infrastructure; or your interactive data preparation tasks primarily revolve around Apache Spark.