Query troubleshooting in Amazon Redshift Spectrum - Amazon Redshift

Query troubleshooting in Amazon Redshift Spectrum

This topic is a reference for common issues you might encounter with Amazon Redshift Spectrum queries.

To view errors generated by Redshift Spectrum queries, query the SVL_S3LOG system table.

Retries exceeded

If an Amazon Redshift Spectrum request times out, the request is canceled and resubmitted. After five failed retries, the query fails with the following error.

error:  Spectrum Scan Error: Retries exceeded

Possible causes include the following:

  • Large file sizes (greater than 1 GB). Check your file sizes in Amazon S3 and look for large files and file size skew. Break up large files into smaller files, between 100 MB and 1 GB. Try to make files about the same size.

  • Slow network throughput. Try your query later.

Access throttled

Amazon Redshift Spectrum is subject to the service quotas of other AWS services. Under high usage, Redshift Spectrum requests might be required to slow down, resulting in the following error.

error:  Spectrum Scan Error: Access throttled

Two types of throttling can happen:

  • Access throttled by Amazon S3.

  • Access throttled by AWS KMS.

The error context provides more details about the type of throttling. Following, you can find causes and possible resolutions for this throttling.

Access throttled by Amazon S3

Amazon S3 might throttle a Redshift Spectrum request if the read request rate on a prefix is too high. For information about a GET/HEAD request rate that you can achieve in Amazon S3, see Optimizing Amazon S3 Performance in Amazon Simple Storage Service User Guide. The Amazon S3 GET/HEAD request rate takes into account all GET/HEAD requests on a prefix so different applications accessing the same prefix share the total requests rate.

If your Redshift Spectrum requests frequently get throttled by Amazon S3, reduce the number of Amazon S3 GET/HEAD requests that Redshift Spectrum makes to Amazon S3. To do this, try merging small files into larger files. We recommend using file sizes of 64 MB or larger.

Also consider partitioning your Redshift Spectrum tables to benefit from early filtering and to reduce the number of files accessed in Amazon S3. For more information, see Partitioning Redshift Spectrum external tables.

Access throttled by AWS KMS

If you store your data in Amazon S3 using server-side encryption (SSE-S3 or SSE-KMS), Amazon S3 calls an API operation to AWS KMS for each file that Redshift Spectrum accesses. These requests count toward your cryptographic operations quota; for more information, see AWS KMS Request Quotas. For more information on SSE-S3 and SSE-KMS, see Protecting Data Using Server-Side Encryption and Protecting Data Using Server-Side Encryption with KMS keys Stored in AWS KMS in Amazon Simple Storage Service User Guide.

A first step to reduce the number of requests that Redshift Spectrum makes to AWS KMS is to reduce the number of files accessed. To do this, try merging small files into larger files. We recommend using file sizes of 64 MB or larger.

If your Redshift Spectrum requests frequently get throttled by AWS KMS, consider requesting a quota increase for your AWS KMS request rate for cryptographic operations. To request a quota increase, see AWS Service Limits in the Amazon Web Services General Reference.

Resource limit exceeded

Redshift Spectrum enforces an upper bound on the amount of memory a request can use. A Redshift Spectrum request that requires more memory fails, resulting in the following error.

error:  Spectrum Scan Error: Resource limit exceeded

There are two common reasons that can cause a Redshift Spectrum request to overrun its memory allowance:

  • Redshift Spectrum processes a large chunk of data that can't be split in smaller chunks.

  • A large aggregation step is processed by Redshift Spectrum.

We recommend using a file format that supports parallel reads with split sizes of 128 MB or less. See Data files for queries in Amazon Redshift Spectrum for supported file formats and generic guidelines for data file creation. When using file formats or compression algorithms that don't support parallel reads, we recommend keeping file sizes between 64 MB and 128 MB.

No rows returned for a partitioned table

If your query returns zero rows from a partitioned external table, check whether a partition has been added for this external table. Redshift Spectrum only scans files in an Amazon S3 location that has been explicitly added using ALTER TABLE … ADD PARTITION. Query the SVV_EXTERNAL_PARTITIONS view to find existing partitions. Run ALTER TABLE … ADD PARTITION for each missing partition.

Not authorized error

Verify that the IAM role for the cluster allows access to the Amazon S3 file objects. If your external database is on Amazon Athena, verify that the IAM role allows access to Athena resources. For more information, see IAM policies for Amazon Redshift Spectrum.

Incompatible data formats

For a columnar file format, such as Apache Parquet, the column type is embedded with the data. The column type in the CREATE EXTERNAL TABLE definition must match the column type of the data file. If there is a mismatch, you receive an error similar to the following:

File 'https://s3bucket/location/file has an incompatible Parquet schema for column ‘s3://s3bucket/location.col1'. Column type: VARCHAR, Par

The error message might be truncated due to the limit on message length. To retrieve the complete error message, including column name and column type, query the SVL_S3LOG system view.

The following example queries SVL_S3LOG for the last query completed.

select message from svl_s3log where query = pg_last_query_id() order by query,segment,slice;

The following is an example of a result that shows the full error message.

                            message
–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––-
Spectrum Scan Error. File 'https://s3bucket/location/file has an incompatible
Parquet schema for column ' s3bucket/location.col1'. 
Column type: VARCHAR, Parquet schema:\noptional int64 l_orderkey [i:0 d:1 r:0]\n

To correct the error, alter the external table to match the column type of the Parquet file.

Syntax error when using Hive DDL in Amazon Redshift

Amazon Redshift supports data definition language (DDL) for CREATE EXTERNAL TABLE that is similar to Hive DDL. However, the two types of DDL aren't always exactly the same. If you copy Hive DDL to create or alter Amazon Redshift external tables, you might encounter syntax errors. The following are examples of differences between Amazon Redshift and Hive DDL:

  • Amazon Redshift requires single quotation marks (') where Hive DDL supports double quotation marks (").

  • Amazon Redshift doesn't support the STRING data type. Use VARCHAR instead.

Permission to create temporary tables

To run Redshift Spectrum queries, the database user must have permission to create temporary tables in the database. The following example grants temporary permission on the database spectrumdb to the spectrumusers user group.

grant temp on database spectrumdb to group spectrumusers;

For more information, see GRANT.

Invalid range

Redshift Spectrum expects that files in Amazon S3 that belong to an external table are not overwritten during a query. If this happens, it can result in the following error.

Error: HTTP response error code: 416 Message: InvalidRange The requested range is not satisfiable

To avoid the error, make sure Amazon S3 files are not overwritten while they are queried with Redshift Spectrum.

Invalid Parquet version number

Redshift Spectrum checks the metadata of each Apache Parquet file it accesses. If the check fails, it can result in an error similar to the following:

File 'https://s3.region.amazonaws.com/s3bucket/location/file has an invalid version number

There are two common reasons that can cause the check to fail:

  • The Parquet file has been overwritten during the query (see Invalid range ).

  • The Parquet file is corrupt.