Launching the Spark history server
You can use a Spark history server to visualize Spark logs on your own infrastructure. You can see the same visualizations in the AWS Glue console for AWS Glue job runs on AWS Glue 4.0 or later versions with logs generated in the Standard (rather than legacy) format. For more information, see Monitoring jobs using the Apache Spark web UI.
You can launch the Spark history server using a AWS CloudFormation template that hosts the server on an EC2 instance, or launch locally using Docker.
Topics
Launching the Spark history server and viewing the Spark UI using AWS CloudFormation
You can use an AWS CloudFormation template to start the Apache Spark history server and view the Spark web UI. These templates are samples that you should modify to meet your requirements.
To start the Spark history server and view the Spark UI using AWS CloudFormation
-
Choose one of the Launch Stack buttons in the following table. This launches the stack on the AWS CloudFormation console.
-
On the Specify template page, choose Next.
-
On the Specify stack details page, enter the Stack name. Enter additional information under Parameters.
-
Spark UI configuration
Provide the following information:
-
IP address range — The IP address range that can be used to view the Spark UI. If you want to restrict access from a specific IP address range, you should use a custom value.
-
History server port — The port for the Spark UI. You can use the default value.
-
Event log directory — Choose the location where Spark event logs are stored from the AWS Glue job or development endpoints. You must use
s3a://
for the event logs path scheme. -
Spark package location — You can use the default value.
-
Keystore path — SSL/TLS keystore path for HTTPS. If you want to use a custom keystore file, you can specify the S3 path
s3://path_to_your_keystore_file
here. If you leave this parameter empty, a self-signed certificate based keystore is generated and used. -
Keystore password — Enter a SSL/TLS keystore password for HTTPS.
-
-
EC2 instance configuration
Provide the following information:
-
Instance type — The type of Amazon EC2 instance that hosts the Spark history server. Because this template launches Amazon EC2 instance in your account, Amazon EC2 cost will be charged in your account separately.
-
Latest AMI ID — The AMI ID of Amazon Linux 2 for the Spark history server instance. You can use the default value.
-
VPC ID — The virtual private cloud (VPC) ID for the Spark history server instance. You can use any of the VPCs available in your account Using a default VPC with a default Network ACL is not recommended. For more information, see Default VPC and Default Subnets and Creating a VPC in the Amazon VPC User Guide.
-
Subnet ID — The ID for the Spark history server instance. You can use any of the subnets in your VPC. You must be able to reach the network from your client to the subnet. If you want to access via the internet, you must use a public subnet that has the internet gateway in the route table.
-
-
Choose Next.
-
-
On the Configure stack options page, to use the current user credentials for determining how CloudFormation can create, modify, or delete resources in the stack, choose Next. You can also specify a role in the Permissions section to use instead of the current user permissions, and then choose Next.
-
On the Review page, review the template.
Select I acknowledge that AWS CloudFormation might create IAM resources, and then choose Create stack.
-
Wait for the stack to be created.
-
Open the Outputs tab.
-
Copy the URL of SparkUiPublicUrl if you are using a public subnet.
-
Copy the URL of SparkUiPrivateUrl if you are using a private subnet.
-
-
Open a web browser, and paste in the URL. This lets you access the server using HTTPS on the specified port. Your browser may not recognize the server's certificate, in which case you have to override its protection and proceed anyway.
Launching the Spark history server and viewing the Spark UI using Docker
If you prefer local access (not to have an EC2 instance for the Apache Spark history server), you can also use Docker to start the Apache Spark history server and view the Spark UI locally. This Dockerfile is a sample that you should modify to meet your requirements.
Prerequisites
For information about how to install Docker on your laptop see the Docker Engine community
To start the Spark history server and view the Spark UI locally using Docker
-
Download files from GitHub.
Download the Dockerfile and
pom.xml
from AWS Glue code samples. -
Determine if you want to use your user credentials or federated user credentials to access AWS.
-
To use the current user credentials for accessing AWS, get the values to use for
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
in thedocker run
command. For more information, see Managing access keys for IAM users in the IAM User Guide. -
To use SAML 2.0 federated users for accessing AWS, get the values for
AWS_ACCESS_KEY_ID
,AWS_SECRET_ACCESS_KEY
, andAWS_SESSION_TOKEN
. For more information, see Requesting temporary security credentials
-
-
Determine the location of your event log directory, to use in the
docker run
command. -
Build the Docker image using the files in the local directory, using the name
glue/sparkui
, and the taglatest
.$ docker build -t glue/sparkui:latest .
-
Create and start the docker container.
In the following commands, use the values obtained previously in steps 2 and 3.
-
To create the docker container using your user credentials, use a command similar to the following
docker run -itd -e SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.fs.logDirectory=s3a://
path_to_eventlog
-Dspark.hadoop.fs.s3a.access.key=AWS_ACCESS_KEY_ID
-Dspark.hadoop.fs.s3a.secret.key=AWS_SECRET_ACCESS_KEY
" -p 18080:18080 glue/sparkui:latest "/opt/spark/bin/spark-class org.apache.spark.deploy.history.HistoryServer" -
To create the docker container using temporary credentials, use
org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider
as the provider, and provide the credential values obtained in step 2. For more information, see Using Session Credentials with TemporaryAWSCredentialsProviderin the Hadoop: Integration with Amazon Web Services documentation. docker run -itd -e SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.fs.logDirectory=s3a://
path_to_eventlog
-Dspark.hadoop.fs.s3a.access.key=AWS_ACCESS_KEY_ID
-Dspark.hadoop.fs.s3a.secret.key=AWS_SECRET_ACCESS_KEY
-Dspark.hadoop.fs.s3a.session.token=AWS_SESSION_TOKEN
-Dspark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider" -p 18080:18080 glue/sparkui:latest "/opt/spark/bin/spark-class org.apache.spark.deploy.history.HistoryServer"
Note
These configuration parameters come from the Hadoop-AWS Module
. You may need to add specific configuration based on your use case. For example: users in isolated regions will need to configure the spark.hadoop.fs.s3a.endpoint
. -
-
Open
http://localhost:18080
in your browser to view the Spark UI locally.