Optimizing query performance using column statistics
You can compute column-level statistics for AWS Glue Data Catalog tables in data formats such as Parquet, ORC, JSON, ION, CSV, and XML without setting up additional data pipelines. Column statistics help you to understand data profiles by getting insights about values within a column.
Data Catalog supports generating statistics for column values such as minimum value, maximum value, total null values, total distinct values, average length of values, and total occurrences of true values. AWS analytical services such as Amazon Redshift and Amazon Athena can use these column statistics to generate query execution plans, and choose the optimal plan that improves query performance.
There are two scenarios for generating column statistics:
- Scheduled
AWS Glue supports scheduling column statistics generation so that it can be run automatically on a recurring schedule.
With scheduled statistics computation, the column statistics task updates the overall table-level statistics, such as min, max, and avg with the new statistics, providing query engines with accurate and up-to-date statistics to optimize query execution.
- On-demand
Use this option to generate column statistics on-demand whenever needed. This is useful for ad-hoc analysis or when statistics need to be computed immediately.
You can configure to run column statistics generation task using AWS Glue console, AWS CLI, and AWS Glue API operations. When you initiate the process, AWS Glue starts a Spark job in the background and updates the AWS Glue table metadata in the Data Catalog. You can view column statistics using AWS Glue console or AWS CLI or by calling the GetColumnStatisticsForTable API operation.
Note
If you're using Lake Formation permissions to control access to the table, the role assumed by the column statistics task requires full table access to generate statistics.
The following video demonstrates how to enhance query performance using column statistics.