Using job parameters in Ray jobs - AWS Glue

Using job parameters in Ray jobs

You set arguments for AWS Glue Ray jobs the same way you set arguments for AWS Glue for Spark jobs. For more information about the AWS Glue API, see Jobs. You can configure AWS Glue Ray jobs with different arguments, which are listed in this reference. You can also provide your own arguments.

You can configure a job through the console, on the Job details tab, under the Job Parameters heading. You can also configure a job through the AWS CLI by setting DefaultArguments on a job, or setting Arguments on a job run. Default arguments and job parameters stay with the job through multiple runs.

For example, the following is the syntax for running a job using --arguments to set a special parameter.

$ aws glue start-job-run --job-name "CSV to CSV" --arguments='--scriptLocation="s3://my_glue/libraries/test_lib.py",--test-environment="true"'

After you set the arguments, you can access job parameters from within your Ray job through environment variables. This gives you a way to configure your job for each run. The name of the environment variable will be the job argument name without the -- prefix.

For instance, in the previous example, the variable names would be scriptLocation and test-environment. You would then retrieve the argument through methods available in the standard library: test_environment = os.environ.get('test-environment'). For more information about accessing environment variables with Python, see os module in the Python documentation.

Configure how Ray jobs generate logs

By default, Ray jobs generate logs and metrics that are sent to CloudWatch and Amazon S3. You can use the --logging_configuration parameter to alter how logs are generated, currently you can use it to stop Ray jobs from generating various types of logs. This parameter takes a JSON object, whose keys correspond to the logs/behaviors you would like to alter. It supports the following keys:

  • CLOUDWATCH_METRICS – Configures CloudWatch metrics series that can be used to visualize job health. For more information about metrics, see Monitoring Ray jobs with metrics.

  • CLOUDWATCH_LOGS – Configures CloudWatch logs that provide Ray application level details about the status the job run. For more information about logs, see Troubleshooting AWS Glue for Ray errors from logs.

  • S3 – Configures what AWS Glue writes to Amazon S3, primarily similar information to CloudWatch logs but as files rather than log streams.

To disable a Ray logging behavior, provide the value {\"IS_ENABLED\": \"False\"}. For example, to disable CloudWatch metrics and CloudWatch logs, provide the following configuration:

"--logging_configuration": "{\"CLOUDWATCH_METRICS\": {\"IS_ENABLED\": \"False\"}, \"CLOUDWATCH_LOGS\": {\"IS_ENABLED\": \"False\"}}"

Reference

Ray jobs recognize the following argument names that you can use to set up the script environment for your Ray jobs and job runs:

  • --logging_configuration – Used to stop the generation of various logs created by Ray jobs. These logs are generated by default on all Ray jobs. Format: String-escaped JSON object. For more information, see Configure how Ray jobs generate logs.

  • --min-workers – The minimum number of worker nodes that are allocated to a Ray job. A worker node can run multiple replicas, one per virtual CPU. Format: integer. Minimum: 0. Maximum: value specified in --number-of-workers (NumberOfWorkers) on the job definition. For more information about accounting for worker nodes, see Accounting for workers in Ray jobs.

  • --object_spilling_config – AWS Glue for Ray supports using Amazon S3 as a way of extending the space available to Ray's object store. To enable this behavior, you can provide Ray an object spillingJSON config object with this parameter. For more information about Ray object spilling configuration, see Object Spilling in the Ray documentation. Format: JSON object.

    AWS Glue for Ray only supports spilling to disk or spilling to Amazon S3 at once. You can provide multiple locations for spilling, as long as they respect this limitation. When spilling to Amazon S3, you will also need to add IAM permissions to your job for this bucket.

    When providing a JSON object as configuration with the CLI, you must provide it as a string, with the JSON object string-escaped. For example, a string value for spilling to one Amazon S3 path would look like: "{\"type\": \"smart_open\", \"params\": {\"uri\":\"s3path\"}}". In AWS Glue Studio, provide this parameter as a JSON object with no extra formatting.

  • --object_store_memory_head – The memory allocated to the Plasma object store on the Ray head node. This instance runs cluster management services, as well as worker replicas. The value represents a percentage of free memory on the instance after a warm start. You use this parameter to tune memory intensive workloads—defaults are acceptable for most use cases. Format: positive integer. Minimum: 1. Maximum: 100.

    For more information about Plasma, see The Plasma In-Memory Object Store in the Ray documentation.

  • --object_store_memory_worker – The memory allocated to the Plasma object store on the Ray worker nodes. These instances only run worker replicas. The value represents a percentage of free memory on the instance after a warm start. This parameter is used to tune memory intensive workloads—defaults are acceptable for most use cases. Format: positive integer. Minimum: 1. Maximum: 100.

    For more information about Plasma, see The Plasma In-Memory Object Store in the Ray documentation.

  • --pip-install – A set of Python packages to be installed. You
 can install packages from PyPI using this argument. Format: comma-delimited
 list.

    A PyPI package entry is in the format package==version, with the PyPI name and
 version of your target package. Entries use Python version matching to match the package and version, such as ==, not the single equals =. There are other
 version-matching operators. For more information, see PEP 440 on the Python website. You can also provide custom modules with --s3-py-modules.

  • --s3-py-modules – A set of Amazon S3 paths that host Python module distributions. Format: comma-delimited list.

    You can use this to distribute your own modules to your Ray job. You can also provide modules from PyPI with --pip-install. Unlike with AWS Glue ETL, custom modules are not set up through pip, but are passed to Ray for distribution. For more information, see Additional Python modules for Ray jobs.

  • --working-dir – A path to a .zip file hosted in Amazon S3 that contains files to be distributed to all nodes running your Ray job. Format: string. For more information, see Providing files to your Ray job.