Amazon Redshift will no longer support the creation of new Python UDFs starting Patch 198. Existing Python UDFs will continue to function until June 30, 2026. For more information, see the [ blog post ](https://aws.amazon.com/blogs/big-data/amazon-redshift-python-user-defined-functions-will-reach-end-of-support-after-june-30-2026/). 

# Loading data in Amazon Redshift
<a name="t_Loading_data"></a>

There are several ways to load data into an Amazon Redshift database. One popular source of data to load are Amazon S3 files. The following table summarizes some of the methods to use with starting from an Amazon S3 source.


| Method to use | Description | When method needed | 
| --- | --- | --- | 
| COPY command | Runs a batch file ingestion to load data from your Amazon S3 files. This method leverages parallel processing capabilities of Amazon Redshift. For more information, see [Loading tables with the COPY command](t_Loading_tables_with_the_COPY_command.md). | Should be used when basic data loading requirements to initiate batch file ingestion manually is needed. This method is used mostly with custom and third-party file ingestion pipelines or one-time, or ad hoc, file ingestion workloads. | 
| COPY... CREATE JOB command (auto-copy) | Runs your COPY commands automatically when a new file is created on tracked Amazon S3 paths. For more information, see [Create an S3 event integration to automatically copy files from Amazon S3 buckets](loading-data-copy-job.md). | Should be used when a file ingestion pipeline needs to automatically ingest data when a new file is created on Amazon S3. Amazon Redshift keeps track of ingested files to prevent data duplication. This method requires configuration by Amazon S3 bucket owners. | 
| Load from data lake queries | Create external tables to run data lake queries on your Amazon S3 files and then run INSERT INTO command to load results from your data lake queries into local tables. For more information, see [External tables for Redshift Spectrum](c-spectrum-external-tables.md). | Should be used in any of the following scenarios:  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/redshift/latest/dg/t_Loading_data.html) | 
| Other methods that you can consider | 
| Streaming ingestion  | Streaming ingestion provides low-latency, high-speed ingestion of stream data from Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka into an Amazon Redshift provisioned or Redshift Serverless materialized view. For more information, see [Getting started with streaming ingestion from Amazon Kinesis Data Streams](materialized-view-streaming-ingestion-getting-started.md) and [Getting started with streaming ingestion from Apache Kafka sources](materialized-view-streaming-ingestion-getting-started-MSK.md). | Should be considered for use cases when data is first streamed into files on Amazon S3 and then loaded from Amazon S3. If keeping data on Amazon S3 is not needed, you can often consider streaming your data directly into Amazon Redshift.  | 
| Running data lake queries | Running queries directly from a data lake table instead of ingesting contents of the table into a local table. For more information, see [Amazon Redshift Spectrum](c-using-spectrum.md). | Should be used when the use case doesn't require the performance of local table queries in Amazon Redshift. | 
| Batch loading using Amazon Redshift query editor v2 | You can prepare and run your batch file ingestion workloads visually on Amazon Redshift query editor v2. For more information, see [Loading data from S3](https://docs.aws.amazon.com/redshift/latest/mgmt/query-editor-v2-loading.html#query-editor-v2-loading-data) in the *Amazon Redshift Management Guide*. | Should be used when you want the query editor v2 to prepare COPY statements and you want a visual tool to simplify the COPY statement preparation process. | 
| Load data from a local file using Amazon Redshift query editor v2 | You can directly upload files from your desktop into Amazon Redshift tables without the need for manually uploading your files into Amazon S3. For more information, see [Loading data from a local file setup and workflow](https://docs.aws.amazon.com/redshift/latest/mgmt/query-editor-v2-loading.html#query-editor-v2-loading-data-local) in the *Amazon Redshift Management Guide*. | Should be used when you need to quickly load files from your local computer for one-time querying purposes. With this method, Amazon Redshift query editor v2 temporarily stores the file on a customer-owned Amazon S3 bucket and runs a copy command using this Amazon S3 path. | 

A COPY command is the most efficient way to load a table. You can also add data to your tables using INSERT commands, though it is much less efficient than using COPY. The COPY command is able to read from multiple data files or multiple data streams simultaneously. Amazon Redshift allocates the workload to the Amazon Redshift nodes and performs the load operations in parallel, including sorting the rows and distributing data across node slices.

**Note**  
Amazon Redshift Spectrum external tables are read-only. You can't COPY or INSERT to an external table.

To access data on other AWS resources, Amazon Redshift must have permission to access those resources and to perform the necessary actions to access the data. You can use AWS Identity and Access Management (IAM) to limit the access users have to Amazon Redshift resources and data.

After your initial data load, if you add, modify, or delete a significant amount of data, you should follow up by running a VACUUM command to reorganize your data and reclaim space after deletes. You should also run an ANALYZE command to update table statistics.

**Topics**
+ [Loading tables with the COPY command](t_Loading_tables_with_the_COPY_command.md)
+ [Create an S3 event integration to automatically copy files from Amazon S3 buckets](loading-data-copy-job.md)
+ [Loading tables with DML commands](t_Updating_tables_with_DML_commands.md)
+ [Performing a deep copy](performing-a-deep-copy.md)
+ [Analyzing tables](t_Analyzing_tables.md)
+ [Vacuuming tables](t_Reclaiming_storage_space202.md)
+ [Managing concurrent write operations](c_Concurrent_writes.md)
+ [Tutorial: Loading data from Amazon S3](tutorial-loading-data.md)

# Loading tables with the COPY command
<a name="t_Loading_tables_with_the_COPY_command"></a>

The COPY command leverages the Amazon Redshift massively parallel processing (MPP) architecture to read and load data in parallel from files on Amazon S3, from a DynamoDB table, or from text output from one or more remote hosts.

Before learning all the options of the COPY command, we recommend learning the basic options to load Amazon S3 data. The *Amazon Redshift Getting Started Guide* demonstrates a simple use of the COPY command to load Amazon S3 data using a default IAM role. See [Step 4: Load data from Amazon S3 to Amazon Redshift](https://docs.aws.amazon.com/redshift/latest/gsg/rs-gsg-create-sample-db.html) for details.

**Note**  
We strongly recommend using the COPY command to load large amounts of data. Using individual INSERT statements to populate a table might be prohibitively slow. Alternatively, if your data already exists in other Amazon Redshift database tables, use INSERT INTO ... SELECT or CREATE TABLE AS to improve performance. For information, see [INSERT](r_INSERT_30.md) or [CREATE TABLE AS](r_CREATE_TABLE_AS.md).

To load data from another AWS resource, Amazon Redshift must have permission to access the resource and perform the necessary actions. 

To grant or revoke privilege to load data into a table using a COPY command, grant or revoke the INSERT privilege.

Your data needs to be in the proper format for loading into your Amazon Redshift table. This section presents guidelines for preparing and verifying your data before the load and for validating a COPY statement before you run it.

To protect the information in your files, you can encrypt the data files before you upload them to your Amazon S3 bucket; COPY will decrypt the data as it performs the load. You can also limit access to your load data by providing temporary security credentials to users. Temporary security credentials provide enhanced security because they have short life spans and cannot be reused after they expire.

Amazon Redshift has features built in to COPY to load uncompressed, delimited data quickly. But you can compress your files using gzip, lzop, or bzip2 to save time uploading the files.

If the following keywords are in the COPY query, automatic splitting of uncompressed data is not supported: ESCAPE, REMOVEQUOTES, and FIXEDWIDTH. But the CSV keyword is supported.

To help keep your data secure in transit within the AWS Cloud, Amazon Redshift uses hardware accelerated SSL to communicate with Amazon S3 or Amazon DynamoDB for COPY, UNLOAD, backup, and restore operations.

When you load your table directly from an Amazon DynamoDB table, you have the option to control the amount of Amazon DynamoDB provisioned throughput you consume.

You can optionally let COPY analyze your input data and automatically apply optimal compression encodings to your table as part of the load process.

**Topics**
+ [Credentials and access permissions](loading-data-access-permissions.md)
+ [Preparing your input data](t_preparing-input-data.md)
+ [Loading data from Amazon S3](t_Loading-data-from-S3.md)
+ [Loading data from Amazon EMR](loading-data-from-emr.md)
+ [Loading data from remote hosts](loading-data-from-remote-hosts.md)
+ [Loading data from an Amazon DynamoDB table](t_Loading-data-from-dynamodb.md)
+ [Verifying that the data loaded correctly](verifying-that-data-loaded-correctly.md)
+ [Validating input data](t_Validating_input_files.md)
+ [Loading tables with automatic compression](c_Loading_tables_auto_compress.md)
+ [Optimizing storage for narrow tables](c_load_compression_hidden_cols.md)
+ [Loading default column values](c_loading_default_values.md)
+ [Troubleshooting data loads](t_Troubleshooting_load_errors.md)

# Credentials and access permissions
<a name="loading-data-access-permissions"></a>

 To load or unload data using another AWS resource, such as Amazon S3, Amazon DynamoDB, Amazon EMR, or Amazon EC2, Amazon Redshift must have permission to access the resource and perform the necessary actions to access the data. For example, to load data from Amazon S3, COPY must have LIST access to the bucket and GET access for the bucket objects. 

To obtain authorization to access a resource, Amazon Redshift must be authenticated. You can choose either role-based access control or key-based access control. This section presents an overview of the two methods. For complete details and examples, see [Permissions to access other AWS Resources](copy-usage_notes-access-permissions.md).

## Role-based access control
<a name="loading-data-access-role-based"></a>

With role-based access control, Amazon Redshift temporarily assumes an AWS Identity and Access Management (IAM) role on your behalf. Then, based on the authorizations granted to the role, Amazon Redshift can access the required AWS resources.

We recommend using role-based access control because it is provides more secure, fine-grained control of access to AWS resources and sensitive user data, in addition to safeguarding your AWS credentials.

To use role-based access control, you must first create an IAM role using the Amazon Redshift service role type, and then attach the role to your data warehouse. The role must have, at a minimum, the permissions listed in [IAM permissions for COPY, UNLOAD, and CREATE LIBRARY](copy-usage_notes-access-permissions.md#copy-usage_notes-iam-permissions). For steps to create an IAM role and attach it to your cluster, see [Creating an IAM Role to Allow Your Amazon Redshift Cluster to Access AWS Services](https://docs.aws.amazon.com/redshift/latest/mgmt/authorizing-redshift-service.html#authorizing-redshift-service-creating-an-iam-role) in the *Amazon Redshift Management Guide*.

You can add a role to a cluster or view the roles associated with a cluster by using the Amazon Redshift Management Console, CLI, or API. For more information, see [Authorizing COPY and UNLOAD Operations Using IAM Roles](https://docs.aws.amazon.com/redshift/latest/mgmt/copy-unload-iam-role.html) in the *Amazon Redshift Management Guide*.

When you create an IAM role, IAM returns an Amazon Resource Name (ARN) for the role. To run a COPY command using an IAM role, provide the role ARN using the IAM\$1ROLE parameter or the CREDENTIALS parameter.

The following COPY command example uses IAM\$1ROLE parameter with the role `MyRedshiftRole` for authentication.

```
COPY customer FROM 's3://amzn-s3-demo-bucket/mydata' 
IAM_ROLE 'arn:aws:iam::12345678901:role/MyRedshiftRole';
```

The AWS user must have, at a minimum, the permissions listed in [IAM permissions for COPY, UNLOAD, and CREATE LIBRARY](copy-usage_notes-access-permissions.md#copy-usage_notes-iam-permissions).

## Key-based access control
<a name="loading-data-access-key-based"></a>

With key-based access control, you provide the access key ID and secret access key for a user that is authorized to access the AWS resources that contain the data. 

**Note**  
We strongly recommend using an IAM role for authentication instead of supplying a plain-text access key ID and secret access key. If you choose key-based access control, never use your AWS account (root) credentials. Always create an IAM user and provide that user's access key ID and secret access key. For steps to create an IAM user, see [Creating an IAM User in Your AWS Account](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_users_create.html).

# Preparing your input data
<a name="t_preparing-input-data"></a>

If your input data is not compatible with the table columns that will receive it, the COPY command will fail.

Use the following guidelines to help ensure that your input data is valid: 
+ Your data can only contain UTF-8 characters up to four bytes long.
+ Verify that CHAR and VARCHAR strings are no longer than the lengths of the corresponding columns. VARCHAR strings are measured in bytes, not characters, so, for example, a four-character string of Chinese characters that occupy four bytes each requires a VARCHAR(16) column.
+ Multibyte characters can only be used with VARCHAR columns. Verify that multibyte characters are no more than four bytes long.
+ Verify that data for CHAR columns only contains single-byte characters.
+ Do not include any special characters or syntax to indicate the last field in a record. This field can be a delimiter.
+ If your data includes null terminators, also referred to as NUL (UTF-8 0000) or binary zero (0x000), you can load these characters as NULLS into CHAR or VARCHAR columns by using the NULL AS option in the COPY command: `null as '\0'` or `null as '\000'` . If you do not use NULL AS, null terminators will cause your COPY to fail.
+ If your strings contain special characters, such as delimiters and embedded newlines, use the ESCAPE option with the [COPY](r_COPY.md) command.
+ Verify that all single and double quotation marks are appropriately matched.
+ Verify that floating-point strings are in either standard floating-point format, such as 12.123, or an exponential format, such as 1.0E4.
+ Verify that all timestamp and date strings follow the specifications for [DATEFORMAT and TIMEFORMAT stringsExample](r_DATEFORMAT_and_TIMEFORMAT_strings.md). The default timestamp format is YYYY-MM-DD hh:mm:ss, and the default date format is YYYY-MM-DD.
+ For more information about boundaries and limitations on individual data types, see [Data types](c_Supported_data_types.md). For information about multibyte character errors, see [Multibyte character load errors](multi-byte-character-load-errors.md)

# Loading data from Amazon S3
<a name="t_Loading-data-from-S3"></a>

The COPY command leverages the Amazon Redshift massively parallel processing (MPP) architecture to read and load data in parallel from a file or multiple files in an Amazon S3 bucket. You can take maximum advantage of parallel processing by splitting your data into multiple files, in cases where the files are compressed. (There are exceptions to this rule. These are detailed in [Loading data files](https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-use-multiple-files.html).) You can also take maximum advantage of parallel processing by setting distribution keys on your tables. For more information about distribution keys, see [Data distribution for query optimization](t_Distributing_data.md). 

Data is loaded into the target table, one line per row. The fields in the data file are matched to table columns in order, left to right. Fields in the data files can be fixed-width or character delimited; the default delimiter is a pipe (\$1). By default, all the table columns are loaded, but you can optionally define a comma-separated list of columns. If a table column is not included in the column list specified in the COPY command, it is loaded with a default value. For more information, see [Loading default column values](c_loading_default_values.md).

**Topics**
+ [Loading data from compressed and uncompressed files](t_splitting-data-files.md)
+ [Uploading files to Amazon S3 to use with COPY](t_uploading-data-to-S3.md)
+ [Using the COPY command to load from Amazon S3](t_loading-tables-from-s3.md)

# Loading data from compressed and uncompressed files
<a name="t_splitting-data-files"></a>

When you load compressed data, we recommend that you split the data for each table into multiple files. When you load uncompressed, delimited data, the COPY command uses massively parallel processing (MPP) and scan ranges to load data from large files in an Amazon S3 bucket.

## Loading data from multiple compressed files
<a name="t_splitting-data-files-compressed"></a>

In cases where you have compressed data, we recommend that you split the data for each table into multiple files. The COPY command can load data from multiple files in parallel. You can load multiple files by specifying a common prefix, or *prefix key*, for the set, or by explicitly listing the files in a manifest file.

Split your data into files so that the number of files is a multiple of the number of slices in your cluster. That way, Amazon Redshift can divide the data evenly among the slices. The number of slices per node depends on the node size of the cluster. For example, each dc2.large compute node has two slices, and each dc2.8xlarge compute node has 16 slices. For more information about the number of slices that each node size has, see [About clusters and nodes](https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-clusters.html#rs-about-clusters-and-nodes) in the *Amazon Redshift Management Guide*. 

The nodes all participate in running parallel queries, working on data that is distributed as evenly as possible across the slices. If you have a cluster with two dc2.large nodes, you might split your data into four files or some multiple of four. Amazon Redshift doesn't take file size into account when dividing the workload. Thus, you need to ensure that the files are roughly the same size, from 1 MB to 1 GB after compression. 

To use object prefixes to identify the load files, name each file with a common prefix. For example, you might split the `venue.txt` file might be split into four files, as follows.

```
venue.txt.1
venue.txt.2
venue.txt.3
venue.txt.4
```

If you put multiple files in a folder in your bucket and specify the folder name as the prefix, COPY loads all of the files in the folder. If you explicitly list the files to be loaded by using a manifest file, the files can reside in different buckets or folders.

For more information about manifest files, see [Using a manifest to specify data files](r_COPY_command_examples.md#copy-command-examples-manifest).

## Loading data from uncompressed, delimited files
<a name="t_splitting-data-files-uncompressed"></a>

When you load uncompressed, delimited data, the COPY command uses the massively parallel processing (MPP) architecture in Amazon Redshift. Amazon Redshift automatically uses slices working in parallel to load ranges of data from a large file in an Amazon S3 bucket. The file must be delimited for parallel loading to occur. For example, pipe delimited. Automatic, parallel data loading with the COPY command is also available for CSV files. You can also take advantage of parallel processing by setting distribution keys on your tables. For more information about distribution keys, see [Data distribution for query optimization](t_Distributing_data.md).

Automatic, parallel data loading isn't supported when the COPY query includes any of the following keywords: ESCAPE, REMOVEQUOTES, and FIXEDWIDTH.

Data from the file or files is loaded into the target table, one line per row. The fields in the data file are matched to table columns in order, left to right. Fields in the data files can be fixed-width or character delimited; the default delimiter is a pipe (\$1). By default, all the table columns are loaded, but you can optionally define a comma-separated list of columns. If a table column isn't included in the column list specified in the COPY command, it's loaded with a default value. For more information, see [Loading default column values](c_loading_default_values.md).

Follow this general process to load data from Amazon S3, when your data is uncompressed and delimited:

1. Upload your files to Amazon S3.

1. Run a COPY command to load the table. 

1. Verify that the data was loaded correctly.

For examples of COPY commands, see [COPY examples](r_COPY_command_examples.md). For information about data loaded into Amazon Redshift, check the [STL\$1LOAD\$1COMMITS](r_STL_LOAD_COMMITS.md) and [STL\$1LOAD\$1ERRORS](r_STL_LOAD_ERRORS.md) system tables. 

For more information about nodes and the slices contained in each, see [About clusters and nodes](https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-clusters.html#rs-about-clusters-and-nodes) in the *Amazon Redshift Management Guide*.

# Uploading files to Amazon S3 to use with COPY
<a name="t_uploading-data-to-S3"></a>

There are a couple approaches to take when uploading text files to Amazon S3:
+ If you have compressed files, we recommend that you split large files to take advantage of parallel processing in Amazon Redshift.
+ On the other hand, COPY automatically splits large, uncompressed, text-delimited file data to facilitate parallelism and effectively distribute the data from large files.

Create an Amazon S3 bucket to hold your data files, and then upload the data files to the bucket. For information about creating buckets and uploading files, see [Working with Amazon S3 Buckets](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingBucket.html) in the *Amazon Simple Storage Service User Guide.* 

**Important**  
The Amazon S3 bucket that holds the data files must be created in the same AWS Region as your cluster unless you use the [REGION](copy-parameters-data-source-s3.md#copy-region) option to specify the Region in which the Amazon S3 bucket is located.

Ensure that the S3 IP ranges are added to your allowlist. To learn more about the required S3 IP ranges, see [ Network isolation](https://docs.aws.amazon.com//redshift/latest/mgmt/security-network-isolation.html#network-isolation).

You can create an Amazon S3 bucket in a specific Region either by selecting the Region when you create the bucket by using the Amazon S3 console, or by specifying an endpoint when you create the bucket using the Amazon S3 API or CLI.

Following the data load, verify that the correct files are present on Amazon S3.

**Topics**
+ [Managing data consistency](managing-data-consistency.md)
+ [Uploading encrypted data to Amazon S3](t_uploading-encrypted-data.md)
+ [Verifying that the correct files are present in your bucket](verifying-that-correct-files-are-present.md)

# Managing data consistency
<a name="managing-data-consistency"></a>

Amazon S3 provides strong read-after-write consistency for COPY, UNLOAD, INSERT (external table), CREATE EXTERNAL TABLE AS, and Amazon Redshift Spectrum operations on Amazon S3 buckets in all AWS Regions. In addition, read operations on Amazon S3 Select, Amazon S3 Access Control Lists, Amazon S3 Object Tags, and object metadata (for example, HEAD object) are strongly consistent. For more information about data consistency, see [Amazon S3 Data Consistency Model](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Introduction.html#ConsistencyModel) in the *Amazon Simple Storage Service User Guide*.

# Uploading encrypted data to Amazon S3
<a name="t_uploading-encrypted-data"></a>

Amazon S3 supports both server-side encryption and client-side encryption. This topic discusses the differences between the server-side and client-side encryption and describes the steps to use client-side encryption with Amazon Redshift. Server-side encryption is transparent to Amazon Redshift. 

## Server-side encryption
<a name="server-side-encryption"></a>

Server-side encryption is data encryption at rest—that is, Amazon S3 encrypts your data as it uploads it and decrypts it for you when you access it. When you load tables using a COPY command, there is no difference in the way you load from server-side encrypted or unencrypted objects on Amazon S3. For more information about server-side encryption, see [Using Server-Side Encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingServerSideEncryption.html) in the *Amazon Simple Storage Service User Guide*.

## Client-side encryption
<a name="client-side-encryption"></a>

In client-side encryption, your client application manages encryption of your data, the encryption keys, and related tools. You can upload data to an Amazon S3 bucket using client-side encryption, and then load the data using the COPY command with the ENCRYPTED option and a private encryption key to provide greater security.

You encrypt your data using envelope encryption. With *envelope encryption,* your application handles all encryption exclusively. Your private encryption keys and your unencrypted data are never sent to AWS, so it's very important that you safely manage your encryption keys. If you lose your encryption keys, you won't be able to unencrypt your data, and you can't recover your encryption keys from AWS. Envelope encryption combines the performance of fast symmetric encryption while maintaining the greater security that key management with asymmetric keys provides. A one-time-use symmetric key (the envelope symmetric key) is generated by your Amazon S3 encryption client to encrypt your data, then that key is encrypted by your root key and stored alongside your data in Amazon S3. When Amazon Redshift accesses your data during a load, the encrypted symmetric key is retrieved and decrypted with your real key, then the data is decrypted.

To work with Amazon S3 client-side encrypted data in Amazon Redshift, follow the steps outlined in [Protecting Data Using Client-Side Encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingClientSideEncryption.html) in the *Amazon Simple Storage Service User Guide*, with the additional requirements that you use:
+ **Symmetric encryption –** The AWS SDK for Java `AmazonS3EncryptionClient` class uses envelope encryption, described preceding, which is based on symmetric key encryption. Use this class to create an Amazon S3 client to upload client-side encrypted data.
+ **A 256-bit AES root symmetric key –** A root key encrypts the envelope key. You pass the root key to your instance of the `AmazonS3EncryptionClient` class. Save this key, because you will need it to copy data into Amazon Redshift.
+ **Object metadata to store encrypted envelope key –** By default, Amazon S3 stores the envelope key as object metadata for the `AmazonS3EncryptionClient` class. The encrypted envelope key that is stored as object metadata is used during the decryption process. 

**Note**  
If you get a cipher encryption error message when you use the encryption API for the first time, your version of the JDK may have a Java Cryptography Extension (JCE) jurisdiction policy file that limits the maximum key length for encryption and decryption transformations to 128 bits. For information about addressing this issue, go to [Specifying Client-Side Encryption Using the AWS SDK for Java](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingClientSideEncryptionUpload.html) in the *Amazon Simple Storage Service User Guide*. 

For information about loading client-side encrypted files into your Amazon Redshift tables using the COPY command, see [Loading encrypted data files from Amazon S3](c_loading-encrypted-files.md).

## Example: Uploading client-side encrypted data
<a name="client-side-encryption-example"></a>

For an example of how to use the AWS SDK for Java to upload client-side encrypted data, go to [Protecting data using client-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/encrypt-client-side-symmetric-master-key.html) in the *Amazon Simple Storage Service User Guide*. 

The second option shows the choices you must make during client-side encryption so that the data can be loaded in Amazon Redshift. Specifically, the example shows using object metadata to store the encrypted envelope key and the use of a 256-bit AES root symmetric key. 

This example provides example code using the AWS SDK for Java to create a 256-bit AES symmetric root key and save it to a file. Then the example upload an object to Amazon S3 using an S3 encryption client that first encrypts sample data on the client-side. The example also downloads the object and verifies that the data is the same.

# Verifying that the correct files are present in your bucket
<a name="verifying-that-correct-files-are-present"></a>

After you upload your files to your Amazon S3 bucket, we recommend listing the contents of the bucket to verify that all of the correct files are present and that no unwanted files are present. For example, if the bucket `amzn-s3-demo-bucket` holds a file named `venue.txt.back`, that file will be loaded, perhaps unintentionally, by the following command:

```
COPY venue FROM 's3://amzn-s3-demo-bucket/venue' … ;
```

If you want to control specifically which files are loaded, you can use a manifest file to explicitly list the data files. For more information about using a manifest file, see the [copy_from_s3_manifest_file](copy-parameters-data-source-s3.md#copy-manifest-file) option for the COPY command and [Using a manifest to specify data files](r_COPY_command_examples.md#copy-command-examples-manifest) in the COPY examples. 

For more information about listing the contents of the bucket, see [Listing Object Keys](https://docs.aws.amazon.com/AmazonS3/latest/userguide/ListingKeysUsingAPIs.html) in the *Amazon S3 Developer Guide*.

# Using the COPY command to load from Amazon S3
<a name="t_loading-tables-from-s3"></a>

Use the [COPY](r_COPY.md) command to load a table in parallel from data files on Amazon S3. You can specify the files to be loaded by using an Amazon S3 object prefix or by using a manifest file.

The syntax to specify the files to be loaded by using a prefix is as follows:

```
COPY <table_name> FROM 's3://<bucket_name>/<object_prefix>'
authorization;
```

 The manifest file is a JSON-formatted file that lists the data files to be loaded. The syntax to specify the files to be loaded by using a manifest file is as follows:

```
COPY <table_name> FROM 's3://<bucket_name>/<manifest_file>'
authorization
MANIFEST;
```

The table to be loaded must already exist in the database. For information about creating a table, see [CREATE TABLE](r_CREATE_TABLE_NEW.md) in the SQL Reference. 

The values for *authorization* provide the AWS authorization Amazon Redshift needs to access the Amazon S3 objects. For information about required permissions, see [IAM permissions for COPY, UNLOAD, and CREATE LIBRARY](copy-usage_notes-access-permissions.md#copy-usage_notes-iam-permissions). The preferred method for authentication is to specify the IAM\$1ROLE parameter and provide the Amazon Resource Name (ARN) for an IAM role with the necessary permissions. For more information, see [Role-based access control](copy-usage_notes-access-permissions.md#copy-usage_notes-access-role-based) . 

To authenticate using the IAM\$1ROLE parameter, replace *<aws-account-id>* and *<role-name>* as shown in the following syntax. 

```
IAM_ROLE 'arn:aws:iam::<aws-account-id>:role/<role-name>'
```

The following example shows authentication using an IAM role.

```
COPY customer 
FROM 's3://amzn-s3-demo-bucket/mydata' 
IAM_ROLE 'arn:aws:iam::0123456789012:role/MyRedshiftRole';
```

For more information about other authorization options, see [Authorization parameters](copy-parameters-authorization.md)

If you want to validate your data without actually loading the table, use the NOLOAD option with the [COPY](r_COPY.md) command.

The following example shows the first few rows of a pipe-delimited data in a file named `venue.txt`.

```
1|Toyota Park|Bridgeview|IL|0
2|Columbus Crew Stadium|Columbus|OH|0
3|RFK Stadium|Washington|DC|0
```

Before uploading the file to Amazon S3, split the file into multiple files so that the COPY command can load it using parallel processing. The number of files should be a multiple of the number of slices in your cluster. Split your load data files so that the files are about equal size, between 1 MB and 1 GB after compression. For more information, see [Loading data from compressed and uncompressed files](t_splitting-data-files.md).

For example, the `venue.txt` file might be split into four files, as follows:

```
venue.txt.1
venue.txt.2
venue.txt.3
venue.txt.4
```

The following COPY command loads the VENUE table using the pipe-delimited data in the data files with the prefix 'venue' in the Amazon S3 bucket `amzn-s3-demo-bucket`. 

**Note**  
The Amazon S3 bucket `amzn-s3-demo-bucket` in the following examples does not exist. For sample COPY commands that use real data in an existing Amazon S3 bucket, see [Load sample data](https://docs.aws.amazon.com/redshift/latest/gsg/cm-dev-t-load-sample-data.html).

```
COPY venue FROM 's3://amzn-s3-demo-bucket/venue'
IAM_ROLE 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
DELIMITER '|';
```

If no Amazon S3 objects with the key prefix 'venue' exist, the load fails.

**Topics**
+ [Using a manifest to specify data files](loading-data-files-using-manifest.md)
+ [Loading compressed data files from Amazon S3](t_loading-gzip-compressed-data-files-from-S3.md)
+ [Loading fixed-width data from Amazon S3](t_loading_fixed_width_data.md)
+ [Loading multibyte data from Amazon S3](t_loading_unicode_data.md)
+ [Loading encrypted data files from Amazon S3](c_loading-encrypted-files.md)

# Using a manifest to specify data files
<a name="loading-data-files-using-manifest"></a>

You can use a manifest to make sure that the COPY command loads all of the required files, and only the required files, for a data load. You can use a manifest to load files from different buckets or files that do not share the same prefix. Instead of supplying an object path for the COPY command, you supply the name of a JSON-formatted text file that explicitly lists the files to be loaded. The URL in the manifest must specify the bucket name and full object path for the file, not just a prefix.

For more information about manifest files, see the COPY example [Using a manifest to specify data files](r_COPY_command_examples.md#copy-command-examples-manifest).

The following example shows the JSON to load files from different buckets and with file names that begin with date stamps.

```
{
  "entries": [
    {"url":"s3://amzn-s3-demo-bucket1/2013-10-04-custdata", "mandatory":true},
    {"url":"s3://amzn-s3-demo-bucket1/2013-10-05-custdata", "mandatory":true},
    {"url":"s3://amzn-s3-demo-bucket2/2013-10-04-custdata", "mandatory":true},
    {"url":"s3://amzn-s3-demo-bucket2/2013-10-05-custdata", "mandatory":true}
  ]
}
```

The optional `mandatory` flag specifies whether COPY should return an error if the file is not found. The default of `mandatory` is `false`. Regardless of any mandatory settings, COPY will terminate if no files are found. 

The following example runs the COPY command with the manifest in the previous example, which is named `cust.manifest`. 

```
COPY customer
FROM 's3://amzn-s3-demo-bucket/cust.manifest' 
IAM_ROLE 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
MANIFEST;
```

## Using a manifest created by UNLOAD
<a name="loading-data-files-using-unload-manifest"></a>

A manifest created by an [UNLOAD](r_UNLOAD.md) operation using the MANIFEST parameter might have keys that are not required for the COPY operation. For example, the following `UNLOAD` manifest includes a `meta` key that is required for an Amazon Redshift Spectrum external table and for loading data files in an `ORC` or `Parquet` file format. The `meta` key contains a `content_length` key with a value that is the actual size of the file in bytes. The COPY operation requires only the `url` key and an optional `mandatory` key.

```
{
  "entries": [
    {"url":"s3://amzn-s3-demo-bucket/unload/manifest_0000_part_00", "meta": { "content_length": 5956875 }},
    {"url":"s3://amzn-s3-demo-bucket/unload/unload/manifest_0001_part_00", "meta": { "content_length": 5997091 }}
 ]
}
```

For more information about manifest files, see [Using a manifest to specify data files](r_COPY_command_examples.md#copy-command-examples-manifest).

# Loading compressed data files from Amazon S3
<a name="t_loading-gzip-compressed-data-files-from-S3"></a>

To load data files that are compressed using gzip, lzop, or bzip2, include the corresponding option: GZIP, LZOP, or BZIP2.

For example, the following command loads from files that were compressing using lzop.

```
COPY customer FROM 's3://amzn-s3-demo-bucket/customer.lzo' 
IAM_ROLE 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
DELIMITER '|' LZOP;
```

**Note**  
If you compress a data file with lzop compression and use the *--filter* option, the COPY command doesn't support it.

# Loading fixed-width data from Amazon S3
<a name="t_loading_fixed_width_data"></a>

Fixed-width data files have uniform lengths for each column of data. Each field in a fixed-width data file has exactly the same length and position. For character data (CHAR and VARCHAR) in a fixed-width data file, you must include leading or trailing spaces as placeholders in order to keep the width uniform. For integers, you must use leading zeros as placeholders. A fixed-width data file has no delimiter to separate columns.

To load a fixed-width data file into an existing table, USE the FIXEDWIDTH parameter in the COPY command. Your table specifications must match the value of fixedwidth\$1spec in order for the data to load correctly.

To load fixed-width data from a file to a table, issue the following command:

```
COPY table_name FROM 's3://amzn-s3-demo-bucket/prefix' 
IAM_ROLE 'arn:aws:iam::0123456789012:role/MyRedshiftRole' 
FIXEDWIDTH 'fixedwidth_spec';
```

The *fixedwidth\$1spec* parameter is a string that contains an identifier for each column and the width of each column, separated by a colon. The **column:width** pairs are delimited by commas. The identifier can be anything that you choose: numbers, letters, or a combination of the two. The identifier has no relation to the table itself, so the specification must contain the columns in the same order as the table.

The following two examples show the same specification, with the first using numeric identifiers and the second using string identifiers:

```
'0:3,1:25,2:12,3:2,4:6'
```

```
'venueid:3,venuename:25,venuecity:12,venuestate:2,venueseats:6'
```

The following example shows fixed-width sample data that could be loaded into the VENUE table using the preceding specifications:

```
1  Toyota Park               Bridgeview  IL0
2  Columbus Crew Stadium     Columbus    OH0
3  RFK Stadium               Washington  DC0
4  CommunityAmerica Ballpark Kansas City KS0
5  Gillette Stadium          Foxborough  MA68756
```

The following COPY command loads this data set into the VENUE table:

```
COPY venue
FROM 's3://amzn-s3-demo-bucket/data/venue_fw.txt' 
IAM_ROLE 'arn:aws:iam::0123456789012:role/MyRedshiftRole' 
FIXEDWIDTH 'venueid:3,venuename:25,venuecity:12,venuestate:2,venueseats:6';
```

# Loading multibyte data from Amazon S3
<a name="t_loading_unicode_data"></a>

If your data includes non-ASCII multibyte characters (such as Chinese or Cyrillic characters), you must load the data to VARCHAR columns. The VARCHAR data type supports four-byte UTF-8 characters, but the CHAR data type only accepts single-byte ASCII characters. You cannot load five-byte or longer characters into Amazon Redshift tables. For more information about CHAR and VARCHAR, see [Data types](c_Supported_data_types.md).

To check which encoding an input file uses, use the Linux * `file` * command: 

```
$ file ordersdata.txt
ordersdata.txt: ASCII English text
$ file uni_ordersdata.dat
uni_ordersdata.dat: UTF-8 Unicode text
```

# Loading encrypted data files from Amazon S3
<a name="c_loading-encrypted-files"></a>

You can use the COPY command to load data files that were uploaded to Amazon S3 using server-side encryption, client-side encryption, or both. 

The COPY command supports the following types of Amazon S3 encryption:
+ Server-side encryption with Amazon S3-managed keys (SSE-S3)
+ Server-side encryption with AWS KMS keys (SSE-KMS)
+ Client-side encryption using a client-side symmetric root key

The COPY command doesn't support the following types of Amazon S3 encryption:
+ Server-side encryption with customer-provided keys (SSE-C)
+ Client-side encryption using an AWS KMS key
+ Client-side encryption using a customer-provided asymmetric root key

For more information about Amazon S3 encryption, see [ Protecting Data Using Server-Side Encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html) and [Protecting Data Using Client-Side Encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingClientSideEncryption.html) in the Amazon Simple Storage Service User Guide.

The [UNLOAD](r_UNLOAD.md) command automatically encrypts files using SSE-S3. You can also unload using SSE-KMS or client-side encryption with a customer managed symmetric key. For more information, see [Unloading encrypted data files](t_unloading_encrypted_files.md)

The COPY command automatically recognizes and loads files encrypted using SSE-S3 and SSE-KMS. You can load files encrypted using a client-side symmetric root key by specifying the ENCRYPTED option and providing the key value. For more information, see [Uploading encrypted data to Amazon S3](t_uploading-encrypted-data.md).

To load client-side encrypted data files, provide the root key value using the MASTER\$1SYMMETRIC\$1KEY parameter and include the ENCRYPTED option.

```
COPY customer FROM 's3://amzn-s3-demo-bucket/encrypted/customer' 
IAM_ROLE 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
MASTER_SYMMETRIC_KEY '<root_key>' 
ENCRYPTED
DELIMITER '|';
```

To load encrypted data files that are gzip, lzop, or bzip2 compressed, include the GZIP, LZOP, or BZIP2 option along with the root key value and the ENCRYPTED option.

```
COPY customer FROM 's3://amzn-s3-demo-bucket/encrypted/customer' 
IAM_ROLE 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
MASTER_SYMMETRIC_KEY '<root_key>'
ENCRYPTED 
DELIMITER '|' 
GZIP;
```

# Loading data from Amazon EMR
<a name="loading-data-from-emr"></a>

You can use the COPY command to load data in parallel from an Amazon EMR cluster configured to write text files to the cluster's Hadoop Distributed File System (HDFS) as fixed-width files, character-delimited files, CSV files, or JSON-formatted files.

## Process for loading data from Amazon EMR
<a name="load-from-emr-process"></a>

This section walks you through the process of loading data from an Amazon EMR cluster. The following sections provide the details that you must accomplish each step.
+ **[Step 1: Configure IAM permissions](#load-from-emr-steps-configure-iam)**

  The users that create the Amazon EMR cluster and run the Amazon Redshift COPY command must have the necessary permissions.
+ **[Step 2: Create an Amazon EMR cluster](#load-from-emr-steps-create-cluster)**

  Configure the cluster to output text files to the Hadoop Distributed File System (HDFS). You will need the Amazon EMR cluster ID and the cluster's main public DNS (the endpoint for the Amazon EC2 instance that hosts the cluster). 
+ **[Step 3: Retrieve the Amazon Redshift cluster public key and cluster node IP addresses](#load-from-emr-steps-retrieve-key-and-ips)**

  The public key enables the Amazon Redshift cluster nodes to establish SSH connections to the hosts. You will use the IP address for each cluster node to configure the host security groups to permit access from your Amazon Redshift cluster using these IP addresses. 
+ **[Step 4: Add the Amazon Redshift cluster public key to each Amazon EC2 host's authorized keys file](#load-from-emr-steps-add-key-to-host)** 

  You add the Amazon Redshift cluster public key to the host's authorized keys file so that the host will recognize the Amazon Redshift cluster and accept the SSH connection. 
+ **[Step 5: Configure the hosts to accept all of the Amazon Redshift cluster's IP addresses](#load-from-emr-steps-configure-security-groups)** 

  Modify the Amazon EMR instance's security groups to add input rules to accept the Amazon Redshift IP addresses.
+ **[Step 6: Run the COPY command to load the data](#load-from-emr-steps-run-copy)**

  From an Amazon Redshift database, run the COPY command to load the data into an Amazon Redshift table. 

## Step 1: Configure IAM permissions
<a name="load-from-emr-steps-configure-iam"></a>

The users that create the Amazon EMR cluster and run the Amazon Redshift COPY command must have the necessary permissions.

**To configure IAM permissions**

1. Add the following permissions for the user that will create the Amazon EMR cluster.

   ```
   ec2:DescribeSecurityGroups
   ec2:RevokeSecurityGroupIngress
   ec2:AuthorizeSecurityGroupIngress
   redshift:DescribeClusters
   ```

1. Add the following permission for the IAM role or user that will run the COPY command.

   ```
   elasticmapreduce:ListInstances
   ```

1. Add the following permission to the Amazon EMR cluster's IAM role.

   ```
   redshift:DescribeClusters
   ```

## Step 2: Create an Amazon EMR cluster
<a name="load-from-emr-steps-create-cluster"></a>

The COPY command loads data from files on the Amazon EMR Hadoop Distributed File System (HDFS). When you create the Amazon EMR cluster, configure the cluster to output data files to the cluster's HDFS.

**To create an Amazon EMR cluster**

1. Create an Amazon EMR cluster in the same AWS Region as the Amazon Redshift cluster. 

   If the Amazon Redshift cluster is in a VPC, the Amazon EMR cluster must be in the same VPC group. If the Amazon Redshift cluster uses EC2-Classic mode (that is, it is not in a VPC), the Amazon EMR cluster must also use EC2-Classic mode. For more information, see [Managing Clusters in Virtual Private Cloud (VPC)](https://docs.aws.amazon.com/redshift/latest/mgmt/managing-clusters-vpc.html) in the *Amazon Redshift Management Guide*.

1. Configure the cluster to output data files to the cluster's HDFS. The HDFS file names must not include asterisks (\$1) or question marks (?).
**Important**  
The file names must not include asterisks ( \$1 ) or question marks ( ? ).

1. Specify **No** for the **Auto-terminate** option in the Amazon EMR cluster configuration so that the cluster remains available while the COPY command runs. 
**Important**  
If any of the data files are changed or deleted before the COPY completes, you might have unexpected results, or the COPY operation might fail. 

1. Note the cluster ID and the main public DNS (the endpoint for the Amazon EC2 instance that hosts the cluster). You will use that information in later steps. 

## Step 3: Retrieve the Amazon Redshift cluster public key and cluster node IP addresses
<a name="load-from-emr-steps-retrieve-key-and-ips"></a>

You will use the IP address for each cluster node to configure the host security groups to permit access from your Amazon Redshift cluster using these IP addresses.

**To retrieve the Amazon Redshift cluster public key and cluster node IP addresses for your cluster using the console**

1. Access the Amazon Redshift Management Console. 

1. Choose the **Clusters** link in the navigation pane. 

1. Select your cluster from the list. 

1. Locate the **SSH Ingestion Settings** group. 

   Note the **Cluster Public Key** and **Node IP addresses**. You will use them in later steps.   
![\[Screenshot from the SSH Ingestion Settings group showing the Cluster Public Key and Node IP addresses.\]](http://docs.aws.amazon.com/redshift/latest/dg/images/copy-from-ssh-console-2.png)

   You will use the private IP addresses in Step 3 to configure the Amazon EC2 host to accept the connection from Amazon Redshift. 

To retrieve the cluster public key and cluster node IP addresses for your cluster using the Amazon Redshift CLI, run the describe-clusters command. For example:

```
aws redshift describe-clusters --cluster-identifier <cluster-identifier> 
```

The response will include a ClusterPublicKey value and the list of private and public IP addresses, similar to the following:

```
{
    "Clusters": [
        {
            "VpcSecurityGroups": [], 
            "ClusterStatus": "available", 
            "ClusterNodes": [
                {
                    "PrivateIPAddress": "10.nnn.nnn.nnn", 
                    "NodeRole": "LEADER", 
                    "PublicIPAddress": "10.nnn.nnn.nnn"
                }, 
                {
                    "PrivateIPAddress": "10.nnn.nnn.nnn", 
                    "NodeRole": "COMPUTE-0", 
                    "PublicIPAddress": "10.nnn.nnn.nnn"
                }, 
                {
                    "PrivateIPAddress": "10.nnn.nnn.nnn", 
                    "NodeRole": "COMPUTE-1", 
                    "PublicIPAddress": "10.nnn.nnn.nnn"
                }
            ], 
            "AutomatedSnapshotRetentionPeriod": 1, 
            "PreferredMaintenanceWindow": "wed:05:30-wed:06:00", 
            "AvailabilityZone": "us-east-1a", 
            "NodeType": "dc2.large", 
            "ClusterPublicKey": "ssh-rsa AAAABexamplepublickey...Y3TAl Amazon-Redshift", 
             ...
             ...
}
```

To retrieve the cluster public key and cluster node IP addresses for your cluster using the Amazon Redshift API, use the `DescribeClusters` action. For more information, see [describe-clusters](https://docs.aws.amazon.com/cli/latest/reference/redshift/describe-clusters.html) in the *Amazon Redshift CLI Guide* or [DescribeClusters](https://docs.aws.amazon.com/redshift/latest/APIReference/API_DescribeClusters.html) in the Amazon Redshift API Guide. 

## Step 4: Add the Amazon Redshift cluster public key to each Amazon EC2 host's authorized keys file
<a name="load-from-emr-steps-add-key-to-host"></a>

You add the cluster public key to each host's authorized keys file for all of the Amazon EMR cluster nodes so that the hosts will recognize Amazon Redshift and accept the SSH connection. 

**To add the Amazon Redshift cluster public key to the host's authorized keys file**

1. Access the host using an SSH connection. 

   For information about connecting to an instance using SSH, see [Connect to Your Instance](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-connect-to-instance-linux.html) in the *Amazon EC2 User Guide*. 

1. Copy the Amazon Redshift public key from the console or from the CLI response text. 

1. Copy and paste the contents of the public key into the `/home/<ssh_username>/.ssh/authorized_keys` file on the host. Include the complete string, including the prefix "`ssh-rsa` " and suffix "`Amazon-Redshift`". For example: 

   ```
   ssh-rsa AAAACTP3isxgGzVWoIWpbVvRCOzYdVifMrh… uA70BnMHCaMiRdmvsDOedZDOedZ Amazon-Redshift
   ```

## Step 5: Configure the hosts to accept all of the Amazon Redshift cluster's IP addresses
<a name="load-from-emr-steps-configure-security-groups"></a>

 To allow inbound traffic to the host instances, edit the security group and add one Inbound rule for each Amazon Redshift cluster node. For **Type**, select SSH with TCP protocol on Port 22. For **Source**, enter the Amazon Redshift cluster node private IP addresses you retrieved in [Step 3: Retrieve the Amazon Redshift cluster public key and cluster node IP addresses](#load-from-emr-steps-retrieve-key-and-ips). For information about adding rules to an Amazon EC2 security group, see [Authorizing Inbound Traffic for Your Instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/authorizing-access-to-an-instance.html) in the *Amazon EC2 User Guide*. 

## Step 6: Run the COPY command to load the data
<a name="load-from-emr-steps-run-copy"></a>

Run a [COPY](r_COPY.md) command to connect to the Amazon EMR cluster and load the data into an Amazon Redshift table. The Amazon EMR cluster must continue running until the COPY command completes. For example, do not configure the cluster to auto-terminate. 

**Important**  
If any of the data files are changed or deleted before the COPY completes, you might have unexpected results, or the COPY operation might fail.

In the COPY command, specify the Amazon EMR cluster ID and the HDFS file path and file name. 

```
COPY sales
FROM 'emr://myemrclusterid/myoutput/part*' CREDENTIALS 
IAM_ROLE 'arn:aws:iam::0123456789012:role/MyRedshiftRole';
```

You can use the wildcard characters asterisk ( `*` ) and question mark ( `?` ) as part of the file name argument. For example, `part*` loads the files `part-0000`, `part-0001`, and so on. If you specify only a folder name, COPY attempts to load all files in the folder.

**Important**  
If you use wildcard characters or use only the folder name, verify that no unwanted files will be loaded or the COPY command will fail. For example, some processes might write a log file to the output folder.

# Loading data from remote hosts
<a name="loading-data-from-remote-hosts"></a>

You can use the COPY command to load data in parallel from one or more remote hosts, such as Amazon EC2 instances or other computers. COPY connects to the remote hosts using SSH and runs commands on the remote hosts to generate text output. 

The remote host can be an Amazon EC2 Linux instance or another Unix or Linux computer configured to accept SSH connections. This guide assumes your remote host is an Amazon EC2 instance. Where the procedure is different for another computer, the guide will point out the difference. 

Amazon Redshift can connect to multiple hosts, and can open multiple SSH connections to each host. Amazon Redshifts sends a unique command through each connection to generate text output to the host's standard output, which Amazon Redshift then reads as it would a text file.

## Before you begin
<a name="load-from-host-before-you-begin"></a>

Before you begin, you should have the following in place: 
+ One or more host machines, such as Amazon EC2 instances, that you can connect to using SSH.
+ Data sources on the hosts. 

  You will provide commands that the Amazon Redshift cluster will run on the hosts to generate the text output. After the cluster connects to a host, the COPY command runs the commands, reads the text from the hosts' standard output, and loads the data in parallel into an Amazon Redshift table. The text output must be in a form that the COPY command can ingest. For more information, see [Preparing your input data](t_preparing-input-data.md)
+ Access to the hosts from your computer. 

  For an Amazon EC2 instance, you will use an SSH connection to access the host. You must access the host to add the Amazon Redshift cluster's public key to the host's authorized keys file.
+ A running Amazon Redshift cluster. 

  For information about how to launch a cluster, see [Amazon Redshift Getting Started Guide](https://docs.aws.amazon.com/redshift/latest/gsg/). 

## Loading data process
<a name="load-from-host-process"></a>

This section walks you through the process of loading data from remote hosts. The following sections provide the details that that you must accomplish in each step.
+ **[Step 1: Retrieve the cluster public key and cluster node IP addresses](#load-from-host-steps-retrieve-key-and-ips)**

  The public key enables the Amazon Redshift cluster nodes to establish SSH connections to the remote hosts. You will use the IP address for each cluster node to configure the host security groups or firewall to permit access from your Amazon Redshift cluster using these IP addresses. 
+ **[Step 2: Add the Amazon Redshift cluster public key to the host's authorized keys file](#load-from-host-steps-add-key-to-host)**

  You add the Amazon Redshift cluster public key to the host's authorized keys file so that the host will recognize the Amazon Redshift cluster and accept the SSH connection. 
+ **[Step 3: Configure the host to accept all of the Amazon Redshift cluster's IP addresses](#load-from-host-steps-configure-security-groups)** 

  For Amazon EC2, modify the instance's security groups to add input rules to accept the Amazon Redshift IP addresses. For other hosts, modify the firewall so that your Amazon Redshift nodes are able to establish SSH connections to the remote host. 
+ **[Step 4: Get the public key for the host](#load-from-host-steps-get-the-host-key)**

  You can optionally specify that Amazon Redshift should use the public key to identify the host. You must locate the public key and copy the text into your manifest file. 
+ **[Step 5: Create a manifest file](#load-from-host-steps-create-manifest)** 

  The manifest is a JSON-formatted text file with the details Amazon Redshift needs to connect to the hosts and fetch the data. 
+ **[Step 6: Upload the manifest file to an Amazon S3 bucket](#load-from-host-steps-upload-manifest)** 

  Amazon Redshift reads the manifest and uses that information to connect to the remote host. If the Amazon S3 bucket does not reside in the same Region as your Amazon Redshift cluster, you must use the [REGION](copy-parameters-data-source-s3.md#copy-region) option to specify the Region in which the data is located.
+ **[Step 7: Run the COPY command to load the data](#load-from-host-steps-run-copy)**

  From an Amazon Redshift database, run the COPY command to load the data into an Amazon Redshift table. 

## Step 1: Retrieve the cluster public key and cluster node IP addresses
<a name="load-from-host-steps-retrieve-key-and-ips"></a>

You will use the IP address for each cluster node to configure the host security groups to permit access from your Amazon Redshift cluster using these IP addresses.

**To retrieve the cluster public key and cluster node IP addresses for your cluster using the console**

1. Access the Amazon Redshift Management Console.

1. Choose the **Clusters** link in the navigation pane.

1. Select your cluster from the list. 

1. Locate the **SSH Ingestion Settings** group.

   Note the **Cluster Public Key** and **Node IP addresses**. You will use them in later steps.  
![\[Screenshot from the SSH Ingestion Settings group showing the Cluster Public Key and Node IP addresses.\]](http://docs.aws.amazon.com/redshift/latest/dg/images/copy-from-ssh-console-2.png)

   You will use the IP addresses in Step 3 to configure the host to accept the connection from Amazon Redshift. Depending on what type of host you connect to and whether it is in a VPC, you will use either the public IP addresses or the private IP addresses.

To retrieve the cluster public key and cluster node IP addresses for your cluster using the Amazon Redshift CLI, run the describe-clusters command. 

For example: 

```
aws redshift describe-clusters --cluster-identifier <cluster-identifier> 
```

 The response will include the ClusterPublicKey and the list of private and public IP addresses, similar to the following: 

```
{
    "Clusters": [
        {
            "VpcSecurityGroups": [], 
            "ClusterStatus": "available", 
            "ClusterNodes": [
                {
                    "PrivateIPAddress": "10.nnn.nnn.nnn", 
                    "NodeRole": "LEADER", 
                    "PublicIPAddress": "10.nnn.nnn.nnn"
                }, 
                {
                    "PrivateIPAddress": "10.nnn.nnn.nnn", 
                    "NodeRole": "COMPUTE-0", 
                    "PublicIPAddress": "10.nnn.nnn.nnn"
                }, 
                {
                    "PrivateIPAddress": "10.nnn.nnn.nnn", 
                    "NodeRole": "COMPUTE-1", 
                    "PublicIPAddress": "10.nnn.nnn.nnn"
                }
            ], 
            "AutomatedSnapshotRetentionPeriod": 1, 
            "PreferredMaintenanceWindow": "wed:05:30-wed:06:00", 
            "AvailabilityZone": "us-east-1a", 
            "NodeType": "dc2.large", 
            "ClusterPublicKey": "ssh-rsa AAAABexamplepublickey...Y3TAl Amazon-Redshift", 
             ...
             ...
}
```

To retrieve the cluster public key and cluster node IP addresses for your cluster using the Amazon Redshift API, use the DescribeClusters action. For more information, see [describe-clusters](https://docs.aws.amazon.com/cli/latest/reference/redshift/describe-clusters.html) in the *Amazon Redshift CLI Guide* or [DescribeClusters](https://docs.aws.amazon.com/redshift/latest/APIReference/API_DescribeClusters.html) in the Amazon Redshift API Guide. 

## Step 2: Add the Amazon Redshift cluster public key to the host's authorized keys file
<a name="load-from-host-steps-add-key-to-host"></a>

You add the cluster public key to each host's authorized keys file so that the host will recognize Amazon Redshift and accept the SSH connection. 

**To add the Amazon Redshift cluster public key to the host's authorized keys file**

1. Access the host using an SSH connection. 

   For information about connecting to an instance using SSH, see [Connect to Your Instance](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-connect-to-instance-linux.html) in the *Amazon EC2 User Guide*. 

1. Copy the Amazon Redshift public key from the console or from the CLI response text. 

1. Copy and paste the contents of the public key into the `/home/<ssh_username>/.ssh/authorized_keys` file on the remote host. The `<ssh_username>` must match the value for the "username" field in the manifest file. Include the complete string, including the prefix "`ssh-rsa` " and suffix "`Amazon-Redshift`". For example: 

   ```
   ssh-rsa AAAACTP3isxgGzVWoIWpbVvRCOzYdVifMrh… uA70BnMHCaMiRdmvsDOedZDOedZ Amazon-Redshift
   ```

## Step 3: Configure the host to accept all of the Amazon Redshift cluster's IP addresses
<a name="load-from-host-steps-configure-security-groups"></a>

 If you are working with an Amazon EC2 instance or an Amazon EMR cluster, add Inbound rules to the host's security group to allow traffic from each Amazon Redshift cluster node. For **Type**, select SSH with TCP protocol on Port 22. For **Source**, enter the Amazon Redshift cluster node IP addresses you retrieved in [Step 1: Retrieve the cluster public key and cluster node IP addresses](#load-from-host-steps-retrieve-key-and-ips). For information about adding rules to an Amazon EC2 security group, see [Authorizing Inbound Traffic for Your Instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/authorizing-access-to-an-instance.html) in the *Amazon EC2 User Guide*. 

Use the private IP addresses when: 
+ You have an Amazon Redshift cluster that is not in a Virtual Private Cloud (VPC), and an Amazon EC2 -Classic instance, both of which are in the same AWS Region. 
+  You have an Amazon Redshift cluster that is in a VPC, and an Amazon EC2 -VPC instance, both of which are in the same AWS Region and in the same VPC.

 Otherwise, use the public IP addresses.

For more information about using Amazon Redshift in a VPC, see [Managing Clusters in Virtual Private Cloud (VPC)](https://docs.aws.amazon.com/redshift/latest/mgmt/managing-clusters-vpc.html) in the *Amazon Redshift Management Guide*. 

## Step 4: Get the public key for the host
<a name="load-from-host-steps-get-the-host-key"></a>

You can optionally provide the host's public key in the manifest file so that Amazon Redshift can identify the host. The COPY command does not require the host public key but, for security reasons, we strongly recommend using a public key to help prevent 'man-in-the-middle' attacks. 

You can find the host's public key in the following location, where `<ssh_host_rsa_key_name>` is the unique name for the host's public key: 

```
:  /etc/ssh/<ssh_host_rsa_key_name>.pub
```

**Note**  
Amazon Redshift only supports RSA keys. We do not support DSA keys.

When you create your manifest file in Step 5, you will paste the text of the public key into the "Public Key" field in the manifest file entry.

## Step 5: Create a manifest file
<a name="load-from-host-steps-create-manifest"></a>

The COPY command can connect to multiple hosts using SSH, and can create multiple SSH connections to each host. COPY runs a command through each host connection, and then loads the output from the commands in parallel into the table. The manifest file is a text file in JSON format that Amazon Redshift uses to connect to the host. The manifest file specifies the SSH host endpoints and the commands that are run on the hosts to return data to Amazon Redshift. Optionally, you can include the host public key, the login user name, and a mandatory flag for each entry.

Create the manifest file on your local computer. In a later step, you upload the file to Amazon S3. 

The manifest file is in the following format:

```
{ 
   "entries": [ 
     {"endpoint":"<ssh_endpoint_or_IP>", 
       "command": "<remote_command>",
       "mandatory":true, 
       "publickey": "<public_key>", 
       "username": "<host_user_name>"}, 
     {"endpoint":"<ssh_endpoint_or_IP>", 
       "command": "<remote_command>",
       "mandatory":true, 
       "publickey": "<public_key>", 
       "username": "host_user_name"} 
    ] 
}
```

The manifest file contains one "entries" construct for each SSH connection. Each entry represents a single SSH connection. You can have multiple connections to a single host or multiple connections to multiple hosts. The double quotation marks are required as shown, both for the field names and the values. The only value that does not need double quotation marks is the Boolean value **true** or **false** for the mandatory field. 

The following describes the fields in the manifest file. 

endpoint  
The URL address or IP address of the host. For example, "`ec2-111-222-333.compute-1.amazonaws.com`" or "`22.33.44.56`" 

command   
The command that will be run by the host to generate text or binary (gzip, lzop, or bzip2) output. The command can be any command that the user *"host\$1user\$1name"* has permission to run. The command can be as simple as printing a file, or it could query a database or launch a script. The output (text file, gzip binary file, lzop binary file, or bzip2 binary file) must be in a form the Amazon Redshift COPY command can ingest. For more information, see [Preparing your input data](t_preparing-input-data.md).

publickey  
(Optional) The public key of the host. If provided, Amazon Redshift will use the public key to identify the host. If the public key is not provided, Amazon Redshift will not attempt host identification. For example, if the remote host's public key is: `ssh-rsa AbcCbaxxx…xxxDHKJ root@amazon.com`, enter the following text in the public key field: `AbcCbaxxx…xxxDHKJ`. 

mandatory  
(Optional) Indicates whether the COPY command should fail if the connection fails. The default is `false`. If Amazon Redshift does not successfully make at least one connection, the COPY command fails.

username  
(Optional) The username that will be used to log on to the host system and run the remote command. The user login name must be the same as the login that was used to add the public key to the host's authorized keys file in Step 2. The default username is "redshift".

The following example shows a completed manifest to open four connections to the same host and run a different command through each connection:

```
{ 
  "entries": [ 
       {"endpoint":"ec2-184-72-204-112.compute-1.amazonaws.com", 
          "command": "cat loaddata1.txt", 
          "mandatory":true, 
          "publickey": "ec2publickeyportionoftheec2keypair", 
          "username": "ec2-user"}, 
       {"endpoint":"ec2-184-72-204-112.compute-1.amazonaws.com", 
          "command": "cat loaddata2.txt", 
          "mandatory":true, 
          "publickey": "ec2publickeyportionoftheec2keypair", 
          "username": "ec2-user"},
       {"endpoint":"ec2-184-72-204-112.compute-1.amazonaws.com", 
          "command": "cat loaddata3.txt", 
          "mandatory":true, 
          "publickey": "ec2publickeyportionoftheec2keypair", 
          "username": "ec2-user"},
       {"endpoint":"ec2-184-72-204-112.compute-1.amazonaws.com", 
          "command": "cat loaddata4.txt", 
          "mandatory":true, 
          "publickey": "ec2publickeyportionoftheec2keypair", 
          "username": "ec2-user"}
     ] 
}
```

## Step 6: Upload the manifest file to an Amazon S3 bucket
<a name="load-from-host-steps-upload-manifest"></a>

Upload the manifest file to an Amazon S3 bucket. If the Amazon S3 bucket does not reside in the same AWS Region as your Amazon Redshift cluster, you must use the [REGION](copy-parameters-data-source-s3.md#copy-region) option to specify the AWS Region in which the manifest is located. For information about creating an Amazon S3 bucket and uploading a file, see [Amazon Simple Storage Service User Guide](https://docs.aws.amazon.com/AmazonS3/latest/userguide/). 

## Step 7: Run the COPY command to load the data
<a name="load-from-host-steps-run-copy"></a>

Run a [COPY](r_COPY.md) command to connect to the host and load the data into an Amazon Redshift table. In the COPY command, specify the explicit Amazon S3 object path for the manifest file and include the SSH option. For example, 

```
COPY sales
FROM 's3://amzn-s3-demo-bucket/ssh_manifest'  
IAM_ROLE 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
DELIMITER '|'
SSH;
```

**Note**  
If you use automatic compression, the COPY command performs two data reads, which means it runs the remote command twice. The first read is to provide a sample for compression analysis, then the second read actually loads the data. If running the remote command twice might cause a problem because of potential side effects, you should turn off automatic compression. To turn off automatic compression, run the COPY command with the COMPUPDATE option set to OFF. For more information, see [Loading tables with automatic compression](c_Loading_tables_auto_compress.md). 

# Loading data from an Amazon DynamoDB table
<a name="t_Loading-data-from-dynamodb"></a>

You can use the COPY command to load a table with data from a single Amazon DynamoDB table.

**Important**  
The Amazon DynamoDB table that provides the data must be created in the same AWS Region as your cluster unless you use the [REGION](copy-parameters-data-source-s3.md#copy-region) option to specify the AWS Region in which the Amazon DynamoDB table is located.

The COPY command uses the Amazon Redshift massively parallel processing (MPP) architecture to read and load data in parallel from an Amazon DynamoDB table. You can take maximum advantage of parallel processing by setting distribution styles on your Amazon Redshift tables. For more information, see [Data distribution for query optimization](t_Distributing_data.md).

**Important**  
When the COPY command reads data from the Amazon DynamoDB table, the resulting data transfer is part of that table's provisioned throughput.

To avoid consuming excessive amounts of provisioned read throughput, we recommend that you not load data from Amazon DynamoDB tables that are in production environments. If you do load data from production tables, we recommend that you set the READRATIO option much lower than the average percentage of unused provisioned throughput. A low READRATIO setting will help minimize throttling issues. To use the entire provisioned throughput of an Amazon DynamoDB table, set READRATIO to 100.

The COPY command matches attribute names in the items retrieved from the DynamoDB table to column names in an existing Amazon Redshift table by using the following rules:
+ Amazon Redshift table columns are case-insensitively matched to Amazon DynamoDB item attributes. If an item in the DynamoDB table contains multiple attributes that differ only in case, such as Price and PRICE, the COPY command will fail.
+ Amazon Redshift table columns that do not match an attribute in the Amazon DynamoDB table are loaded as either NULL or empty, depending on the value specified with the EMPTYASNULL option in the [COPY](r_COPY.md) command.
+ Amazon DynamoDB attributes that do not match a column in the Amazon Redshift table are discarded. Attributes are read before they are matched, and so even discarded attributes consume part of that table's provisioned throughput.
+ Only Amazon DynamoDB attributes with scalar STRING and NUMBER data types are supported. The Amazon DynamoDB BINARY and SET data types are not supported. If a COPY command tries to load an attribute with an unsupported data type, the command will fail. If the attribute does not match an Amazon Redshift table column, COPY does not attempt to load it, and it does not raise an error.

The COPY command uses the following syntax to load data from an Amazon DynamoDB table:

```
COPY <redshift_tablename> FROM 'dynamodb://<dynamodb_table_name>'
authorization
readratio '<integer>';
```

The values for *authorization* are the AWS credentials needed to access the Amazon DynamoDB table. If these credentials correspond to a user, that user must have permission to SCAN and DESCRIBE the Amazon DynamoDB table that is being loaded.

The values for *authorization* provide the AWS authorization your cluster needs to access the Amazon DynamoDB table. The permission must include SCAN and DESCRIBE for the Amazon DynamoDB table that is being loaded. For more information about required permissions, see [IAM permissions for COPY, UNLOAD, and CREATE LIBRARY](copy-usage_notes-access-permissions.md#copy-usage_notes-iam-permissions). The preferred method for authentication is to specify the IAM\$1ROLE parameter and provide the Amazon Resource Name (ARN) for an IAM role with the necessary permissions. For more information, see [Role-based access control](copy-usage_notes-access-permissions.md#copy-usage_notes-access-role-based). 

To authenticate using the IAM\$1ROLE parameter, *<aws-account-id>* and *<role-name>* as shown in the following syntax. 

```
IAM_ROLE 'arn:aws:iam::<aws-account-id>:role/<role-name>'
```

The following example shows authentication using an IAM role.

```
COPY favoritemovies 
FROM 'dynamodb://ProductCatalog'
IAM_ROLE 'arn:aws:iam::0123456789012:role/MyRedshiftRole';
```

For more information about other authorization options, see [Authorization parameters](copy-parameters-authorization.md)

If you want to validate your data without actually loading the table, use the NOLOAD option with the [COPY](r_COPY.md) command.

The following example loads the FAVORITEMOVIES table with data from the DynamoDB table my-favorite-movies-table. The read activity can consume up to 50% of the provisioned throughput.

```
COPY favoritemovies FROM 'dynamodb://my-favorite-movies-table' 
IAM_ROLE 'arn:aws:iam::0123456789012:role/MyRedshiftRole' 
READRATIO 50;
```

To maximize throughput, the COPY command loads data from an Amazon DynamoDB table in parallel across the compute nodes in the cluster.

## Provisioned throughput with automatic compression
<a name="t_Loading-data-from-dynamodb-provisioned-throughput-with-automatic-compression"></a>

By default, the COPY command applies automatic compression whenever you specify an empty target table with no compression encoding. The automatic compression analysis initially samples a large number of rows from the Amazon DynamoDB table. The sample size is based on the value of the COMPROWS parameter. The default is 100,000 rows per slice.

After sampling, the sample rows are discarded and the entire table is loaded. As a result, many rows are read twice. For more information about how automatic compression works, see [Loading tables with automatic compression](c_Loading_tables_auto_compress.md).

**Important**  
When the COPY command reads data from the Amazon DynamoDB table, including the rows used for sampling, the resulting data transfer is part of that table's provisioned throughput.

## Loading multibyte data from Amazon DynamoDB
<a name="t_Loading-data-from-dynamodb-loading-multibyte-data-from-amazon-dynamodb"></a>

If your data includes non-ASCII multibyte characters (such as Chinese or Cyrillic characters), you must load the data to VARCHAR columns. The VARCHAR data type supports four-byte UTF-8 characters, but the CHAR data type only accepts single-byte ASCII characters. You cannot load five-byte or longer characters into Amazon Redshift tables. For more information about CHAR and VARCHAR, see [Data types](c_Supported_data_types.md).

# Verifying that the data loaded correctly
<a name="verifying-that-data-loaded-correctly"></a>

After the load operation is complete, query the [STL\$1LOAD\$1COMMITS](r_STL_LOAD_COMMITS.md) system table to verify that the expected files were loaded. Run the COPY command and load verification within the same transaction so that if there is problem with the load you can roll back the entire transaction.

The following query returns entries for loading the tables in the TICKIT database:

```
SELECT query, trim(filename) AS filename, curtime, status
FROM stl_load_commits
WHERE filename like '%tickit%' order by query;


 query |         filename          |          curtime           | status
-------+---------------------------+----------------------------+--------
 22475 | tickit/allusers_pipe.txt  | 2013-02-08 20:58:23.274186 |      1
 22478 | tickit/venue_pipe.txt     | 2013-02-08 20:58:25.070604 |      1
 22480 | tickit/category_pipe.txt  | 2013-02-08 20:58:27.333472 |      1
 22482 | tickit/date2008_pipe.txt  | 2013-02-08 20:58:28.608305 |      1
 22485 | tickit/allevents_pipe.txt | 2013-02-08 20:58:29.99489  |      1
 22487 | tickit/listings_pipe.txt  | 2013-02-08 20:58:37.632939 |      1
 22489 | tickit/sales_tab.txt      | 2013-02-08 20:58:37.632939 |      1
(6 rows)
```

# Validating input data
<a name="t_Validating_input_files"></a>

To validate the data in the Amazon S3 input files or Amazon DynamoDB table before you actually load the data, use the NOLOAD option with the [COPY](r_COPY.md) command. Use NOLOAD with the same COPY commands and options you would use to load the data. NOLOAD checks the integrity of all of the data without loading it into the database. The NOLOAD option displays any errors that occur if you attempt to load the data.

For example, if you specified the incorrect Amazon S3 path for the input file, Amazon Redshift would display the following error.

```
ERROR:  No such file or directory
DETAIL:
-----------------------------------------------
Amazon Redshift error:  The specified key does not exist
code:      2
context:   S3 key being read :
location:  step_scan.cpp:1883
process:   xenmaster [pid=22199]
-----------------------------------------------
```

To troubleshoot error messages, see the [Load error reference](r_Load_Error_Reference.md). 

For an example using the NOLOAD option, see [COPY command with the NOLOAD option](r_COPY_command_examples.md#r_COPY_command_examples-load-noload-option).

# Loading tables with automatic compression
<a name="c_Loading_tables_auto_compress"></a>

You can apply compression encodings to columns in tables manually, based on your own evaluation of the data. Or you can use the COPY command with COMPUPDATE set to ON to analyze and apply compression automatically based on sample data. 

You can use automatic compression when you create and load a brand new table. The COPY command performs a compression analysis. You can also perform a compression analysis without loading data or changing the compression on a table by running the [ANALYZE COMPRESSION](r_ANALYZE_COMPRESSION.md) command on an already populated table. For example, you can run ANALYZE COMPRESSION when you want to analyze compression on a table for future use, while preserving the existing data definition language (DDL) statements.

Automatic compression balances overall performance when choosing compression encodings. Range-restricted scans might perform poorly if sort key columns are compressed much more highly than other columns in the same query. As a result, automatic compression skips the data analyzing phase on the sort key columns and keeps the user-defined encoding types. 

Automatic compression chooses RAW encoding if you haven't explicitly defined a type of encoding. ANALYZE COMPRESSION behaves the same. For optimal query performance, consider using RAW for sort keys.

## How automatic compression works
<a name="c_Loading_tables_auto_compress-how-automatic-compression-works"></a>

When the COMPUPDATE parameter is ON, the COPY command applies automatic compression whenever you run the COPY command with an empty target table and all of the table columns either have RAW encoding or no encoding.

To apply automatic compression to an empty table, regardless of its current compression encodings, run the COPY command with the COMPUPDATE option set to ON. To turn off automatic compression, run the COPY command with the COMPUPDATE option set to OFF.

You cannot apply automatic compression to a table that already contains data.

**Note**  
Automatic compression analysis requires enough rows in the load data (at least 100,000 rows per slice) to generate a meaningful sample.

Automatic compression performs these operations in the background as part of the load transaction:

1. An initial sample of rows is loaded from the input file. Sample size is based on the value of the COMPROWS parameter. The default is 100,000.

1. Compression options are chosen for each column.

1. The sample rows are removed from the table.

1. The table is recreated with the chosen compression encodings.

1. The entire input file is loaded and compressed using the new encodings.

After you run the COPY command, the table is fully loaded, compressed, and ready for use. If you load more data later, appended rows are compressed according to the existing encoding.

If you only want to perform a compression analysis, run ANALYZE COMPRESSION, which is more efficient than running a full COPY. Then you can evaluate the results to decide whether to use automatic compression or recreate the table manually.

Automatic compression is supported only for the COPY command. Alternatively, you can manually apply compression encoding when you create the table. For information about manual compression encoding, see [Column compression to reduce the size of stored data](t_Compressing_data_on_disk.md).

## Automatic compression example
<a name="r_COPY_COMPRESS_examples"></a>

In this example, assume that the TICKIT database contains a copy of the LISTING table called BIGLIST, and you want to apply automatic compression to this table when it is loaded with approximately 3 million rows.

**To load and automatically compress the table**

1. Make sure that the table is empty. You can apply automatic compression only to an empty table:

   ```
   TRUNCATE biglist;
   ```

1. Load the table with a single COPY command. Although the table is empty, some earlier encoding might have been specified. To facilitate that Amazon Redshift performs a compression analysis, set the COMPUPDATE parameter to ON.

   ```
   COPY biglist FROM 's3://amzn-s3-demo-bucket/biglist.txt' 
   IAM_ROLE 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
   DELIMITER '|' COMPUPDATE ON;
   ```

   Because no COMPROWS option is specified, the default and recommended sample size of 100,000 rows per slice is used.

1. Look at the new schema for the BIGLIST table in order to review the automatically chosen encoding schemes.

   ```
   SELECT "column", type, encoding 
   from pg_table_def where tablename = 'biglist';
   
   
        Column     |            Type             | Encoding 
   ----------------+-----------------------------+----------
    listid         | integer                     | az64
    sellerid       | integer                     | az64
    eventid        | integer                     | az64
    dateid         | smallint                    | none
    numtickets     | smallint                    | az64
    priceperticket | numeric(8,2)                | az64
    totalprice     | numeric(8,2)                | az64
    listtime       | timestamp without time zone | az64
   ```

1. Verify that the expected number of rows were loaded: 

   ```
   select count(*) from biglist;
   
   count
   ---------
   3079952
   (1 row)
   ```

When rows are later appended to this table using COPY or INSERT statements, the same compression encodings are applied.

# Optimizing storage for narrow tables
<a name="c_load_compression_hidden_cols"></a>

If you have a table with very few columns but a very large number of rows, the three hidden metadata identity columns (INSERT\$1XID, DELETE\$1XID, ROW\$1ID) will consume a disproportionate amount of the disk space for the table.

 In order to optimize compression of the hidden columns, load the table in a single COPY transaction where possible. If you load the table with multiple separate COPY commands, the INSERT\$1XID column will not compress well. You must perform a vacuum operation if you use multiple COPY commands, but it will not improve compression of INSERT\$1XID.

# Loading default column values
<a name="c_loading_default_values"></a>

You can optionally define a column list in your COPY command. If a column in the table is omitted from the column list, COPY will load the column with either the value supplied by the DEFAULT option that was specified in the CREATE TABLE command, or with NULL if the DEFAULT option was not specified.

If COPY attempts to assign NULL to a column that is defined as NOT NULL, the COPY command fails. For information about assigning the DEFAULT option, see [CREATE TABLE](r_CREATE_TABLE_NEW.md).

When loading from data files on Amazon S3, the columns in the column list must be in the same order as the fields in the data file. If a field in the data file does not have a corresponding column in the column list, the COPY command fails.

When loading from Amazon DynamoDB table, order does not matter. Any fields in the Amazon DynamoDB attributes that do not match a column in the Amazon Redshift table are discarded.

The following restrictions apply when using the COPY command to load DEFAULT values into a table: 
+ If an [IDENTITY](r_CREATE_TABLE_NEW.md#identity-clause) column is included in the column list, the EXPLICIT\$1IDS option must also be specified in the [COPY](r_COPY.md) command, or the COPY command will fail. Similarly, if an IDENTITY column is omitted from the column list, and the EXPLICIT\$1IDS option is specified, the COPY operation will fail.
+ Because the evaluated DEFAULT expression for a given column is the same for all loaded rows, a DEFAULT expression that uses a RANDOM() function will assign to same value to all the rows.
+ DEFAULT expressions that contain CURRENT\$1DATE or SYSDATE are set to the timestamp of the current transaction.

For an example, see "Load data from a file with default values" in [COPY examples](r_COPY_command_examples.md).

# Troubleshooting data loads
<a name="t_Troubleshooting_load_errors"></a>

When you load data into Amazon Redshift tables you might encounter errors from Amazon S3, invalid input data, and COPY command errors. The following sections provide information about identifying and resolving data load errors.

**Topics**
+ [Troubleshooting S3 event integration and COPY JOB errors](s3-integration-troubleshooting.md)
+ [S3ServiceException errors](s3serviceexception-error.md)
+ [System tables for troubleshooting data loads](system-tables-for-troubleshooting-data-loads.md)
+ [Multibyte character load errors](multi-byte-character-load-errors.md)
+ [Load error reference](r_Load_Error_Reference.md)

# Troubleshooting S3 event integration and COPY JOB errors
<a name="s3-integration-troubleshooting"></a>

Use the following information to troubleshoot common issues with Amazon S3 event integrations and COPY JOB with Amazon Redshift.

## Creation of the S3 event integration failed
<a name="s3-integration-troubleshooting-creation"></a>

If the creation of the S3 event integration failed, the status of the integration is `Inactive`. Make sure that the following are correct for your Amazon Redshift data warehouse.
+ You added the correct authorized principal and integration source for your target namespace in Amazon Redshift. See [Prerequisites to creating an S3 event integration](loading-data-copy-job.md#loading-data-copy-job-prerequisites).
+ You added the correct resource-based policy to the source Amazon S3 bucket. See [Prerequisites to creating an S3 event integration](loading-data-copy-job.md#loading-data-copy-job-prerequisites).

## Your Amazon S3 data is not appearing in the target database
<a name="s3-integration-troubleshooting-missing-data"></a>

If data from a COPY JOB doesn't appear, check the following.
+ Query SYS\$1COPY\$1JOB\$1DETAIL to view if the Amazon S3 file has been loaded, whether its pending ingestion, or there is an error. For more information, see [SYS\$1COPY\$1JOB\$1DETAIL](SYS_COPY_JOB_DETAIL.md).
+ Consult STL\$1ERROR or SYS\$1COPY\$1JOB\$1INFO if the Amazon S3 file is not there or there is unexpected wait time. Look for credential errors or anything that suggests the integration is inactive. For more information, see [STL\$1ERROR](r_STL_ERROR.md) and [SYS\$1COPY\$1JOB\$1INFO](SYS_COPY_JOB_INFO.md).

# S3ServiceException errors
<a name="s3serviceexception-error"></a>

The most common s3ServiceException errors are caused by an improperly formatted or incorrect credentials string, having your cluster and your bucket in different AWS Regions, and insufficient Amazon S3 permissions.

The section provides troubleshooting information for each type of error.

## Invalid credentials string
<a name="invalid-credentials-string-error"></a>

If your credentials string was improperly formatted, you will receive the following error message: 

```
ERROR: Invalid credentials. Must be of the format: credentials 
'aws_access_key_id=<access-key-id>;aws_secret_access_key=<secret-access-key>
[;token=<temporary-session-token>]'
```

Verify that the credentials string does not contain any spaces or line breaks, and is enclosed in single quotation marks. 

## Invalid access key ID
<a name="invalid-access-key-id-error"></a>

If your access key ID does not exist, you will receive the following error message: 

```
[Amazon](500310) Invalid operation: S3ServiceException:The AWS Access Key Id you provided does not exist in our records.
```

This is often a copy and paste error. Verify that the access key ID was entered correctly. Also, if you are using temporary session keys, check that the value for `token` is set.

## Invalid secret access key
<a name="invalid-secret-access-key-error"></a>

If your secret access key is incorrect, you will receive the following error message: 

```
[Amazon](500310) Invalid operation: S3ServiceException:The request signature we calculated does not match the signature you provided. 
Check your key and signing method.,Status 403,Error SignatureDoesNotMatch
```

This is often a copy and paste error. Verify that the secret access key was entered correctly and that it is the correct key for the access key ID.

## Bucket is in a different Region
<a name="bucket-in-different-region"></a>

The Amazon S3 bucket specified in the COPY command must be in the same AWS Region as the cluster. If your Amazon S3 bucket and your cluster are in different Regions, you will receive an error similar to the following: 

```
ERROR: S3ServiceException:The bucket you are attempting to access must be addressed using the specified endpoint.
```

You can create an Amazon S3 bucket in a specific Region either by selecting the Region when you create the bucket by using the Amazon S3 Management Console, or by specifying an endpoint when you create the bucket using the Amazon S3 API or CLI. For more information, see [Uploading files to Amazon S3 to use with COPY](t_uploading-data-to-S3.md).

For more information about Amazon S3 regions, see [Accessing a Bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingBucket.html#access-bucket-intro) in the *Amazon Simple Storage Service User Guide*.

Alternatively, you can specify the Region using the [REGION](copy-parameters-data-source-s3.md#copy-region) option with the COPY command.

## Access denied
<a name="s3-access-denied-error"></a>

If the user does not have sufficient permissions, you will receive the following error message:

```
ERROR: S3ServiceException:Access Denied,Status 403,Error AccessDenied
```

One possible cause is the user identified by the credentials does not have LIST and GET access to the Amazon S3 bucket. For other causes, see [Troubleshoot Access Denied (403 Forbidden) errors in Amazon S3](https://docs.aws.amazon.com/AmazonS3/latest/userguide/troubleshoot-403-errors.html) in the *Amazon Simple Storage Service User Guide*.

For information about managing user access to buckets, see [Identity and access management in Amazon S3](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-access-control.html) in the *Amazon Simple Storage Service User Guide*.

# System tables for troubleshooting data loads
<a name="system-tables-for-troubleshooting-data-loads"></a>

The following Amazon Redshift system tables can be helpful in troubleshooting data load issues:
+ Query [STL\$1LOAD\$1ERRORS](r_STL_LOAD_ERRORS.md) to discover the errors that occurred during specific loads.
+ Query [STL\$1FILE\$1SCAN](r_STL_FILE_SCAN.md) to view load times for specific files or to see if a specific file was even read.
+ Query [STL\$1S3CLIENT\$1ERROR](r_STL_S3CLIENT_ERROR.md) to find details for errors encountered while transferring data from Amazon S3.

**To find and diagnose load errors**

1. Create a view or define a query that returns details about load errors. The following example joins the STL\$1LOAD\$1ERRORS table to the STV\$1TBL\$1PERM table to match table IDs with actual table names. 

   ```
   create view loadview as
   (select distinct tbl, trim(name) as table_name, query, starttime,
   trim(filename) as input, line_number, colname, err_code,
   trim(err_reason) as reason
   from stl_load_errors sl, stv_tbl_perm sp
   where sl.tbl = sp.id);
   ```

1. Set the MAXERRORS option in your COPY command to a large enough value to enable COPY to return useful information about your data. If the COPY encounters errors, an error message directs you to consult the STL\$1LOAD\$1ERRORS table for details.

1. Query the LOADVIEW view to see error details. For example: 

   ```
   select * from loadview where table_name='venue';
   ```

   ```
     tbl   | table_name | query |         starttime          
   --------+------------+-------+----------------------------
    100551 | venue      | 20974 | 2013-01-29 19:05:58.365391 
   
   |     input      | line_number | colname | err_code |       reason
   +----------------+-------------+---------+----------+--------------------
   | venue_pipe.txt |           1 |       0 |     1214 | Delimiter not found
   ```

1. Fix the problem in the input file or the load script, based on the information that the view returns. Some typical load errors to watch for include: 
   + Mismatch between data types in table and values in input data fields.
   + Mismatch between number of columns in table and number of fields in input data.
   + Mismatched quotation marks. Amazon Redshift supports both single and double quotation marks; however, these quotation marks must be balanced appropriately.
   + Incorrect format for date/time data in input files.
   + Out-of-range values in input files (for numeric columns).
   + Number of distinct values for a column exceeds the limitation for its compression encoding.

# Multibyte character load errors
<a name="multi-byte-character-load-errors"></a>

Columns with a CHAR data type only accept single-byte UTF-8 characters, up to byte value 127, or 7F hex, which is also the ASCII character set. VARCHAR columns accept multibyte UTF-8 characters, to a maximum of four bytes. For more information, see [Character types](r_Character_types.md). 

If a line in your load data contains a character that is not valid for the column data type, COPY returns an error and logs a row in the STL\$1LOAD\$1ERRORS system log table with error number 1220. The ERR\$1REASON field includes the byte sequence, in hex, for the invalid character. 

An alternative to fixing not valid characters in your load data is to replace the not valid characters during the load process. To replace not valid UTF-8 characters, specify the ACCEPTINVCHARS option with the COPY command. If the ACCEPTINVCHARS option is set, the character you specify replaces the code point. If the ACCEPTINVCHARS option isn't set, Amazon Redshift accepts the characters as valid UTF-8. For more information, see [ACCEPTINVCHARS](copy-parameters-data-conversion.md#acceptinvchars).

The following list of code points are valid UTF-8, COPY operations don't return an error if the ACCEPTINVCHARS option is not set. However, these code points are not valid characters. You can use the [ACCEPTINVCHARS](copy-parameters-data-conversion.md#acceptinvchars) option to replace a code point with a character that you specify. These code points include the range of values from `0xFDD0` to `0xFDEF` and values up to `0x10FFFF`, ending with `FFFE` or `FFFF`:
+ `0xFFFE`, `0x1FFFE`, `0x2FFFE`, …, `0xFFFFE`, `0x10FFFE`
+ `0xFFFF`, `0x1FFFF`, `0x2FFFF`, …, `0xFFFFF`, `0x10FFFF`

The following example shows the error reason when COPY attempts to load UTF-8 character `e0 a1 c7a4` into a CHAR column.

```
Multibyte character not supported for CHAR 
(Hint: Try using  VARCHAR). Invalid char: e0 a1 c7a4
```

If the error is related to a VARCHAR data type, the error reason includes an error code as well as the not valid UTF-8 hex sequence. The following example shows the error reason when COPY attempts to load UTF-8 `a4` into a VARCHAR field.

```
String contains invalid or unsupported UTF-8 codepoints. 
Bad UTF-8 hex sequence: a4 (error 3)
```

The following table lists the descriptions and suggested workarounds for VARCHAR load errors. If one of these errors occurs, replace the character with a valid UTF-8 code sequence or remove the character.

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/redshift/latest/dg/multi-byte-character-load-errors.html)

# Load error reference
<a name="r_Load_Error_Reference"></a>

If any errors occur while loading data from a file, query the [STL\$1LOAD\$1ERRORS](r_STL_LOAD_ERRORS.md) table to identify the error and determine the possible explanation. The following table lists all error codes that might occur during data loads:

## Load error codes
<a name="r_Load_Error_Reference-load-error-codes"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/redshift/latest/dg/r_Load_Error_Reference.html)

# Create an S3 event integration to automatically copy files from Amazon S3 buckets
<a name="loading-data-copy-job"></a>

**Note**  
The preview release for auto-copy has ended. Consequently, preview clusters will be automatically removed 30 days after the end of the preview period. If you plan to continue using auto-copy, we recommend re-creating your existing auto-copy jobs on another Amazon Redshift cluster. Upgrading a preview cluster to the latest Amazon Redshift version is not supported.

You can use an auto-copy job to load data into your Amazon Redshift tables from files that are stored in Amazon S3. Amazon Redshift detects when new Amazon S3 files are added to the path specified in your COPY command. A COPY command is then automatically run without you having to create an external data ingestion pipeline. Amazon Redshift keeps track of which files have been loaded. Amazon Redshift determines the number of files batched together per COPY command. You can see the resulting COPY commands in system views.

The first step to create an automatic COPY JOB is to create an S3 event integration. When a new file appears in the Amazon S3 source bucket, Amazon Redshift then manages loading the files into your database using the COPY command.

## Prerequisites to creating an S3 event integration
<a name="loading-data-copy-job-prerequisites"></a>

To set up your s3 event integration, confirm the following prerequisites are completed.
+ Your Amazon S3 bucket must have a bucket policy that allows several Amazon S3 permissions. For example, the following example policy allows permissions for the resource bucket `amzn-s3-demo-bucket` that is hosted in *us-east-1*. Both the Amazon S3 bucket and the integration are in the same AWS Region.

------
#### [ JSON ]

****  

  ```
  {
      "Version":"2012-10-17",		 	 	 
      "Statement": [
          {
              "Sid": "Auto-Copy-Policy-01",
              "Effect": "Allow",
              "Principal": {
                  "Service": "redshift.amazonaws.com"
                  },
              "Action": [
                  "s3:GetBucketNotification",
                  "s3:PutBucketNotification",
                  "s3:GetBucketLocation"
              ],
              "Resource": "arn:aws:s3:::amzn-s3-demo-bucket:*",
              "Condition": {
                  "ArnLike": {
                      "aws:SourceArn": "arn:aws:redshift:us-east-1:111122223333:integration:*"
                  },
                  "StringEquals": {
                      "aws:SourceAccount": "111122223333"
                  }
              }
          }
      ]
  }
  ```

------
+ Your target Amazon Redshift provisioned cluster or Redshift Serverless namespace must have permission to the bucket. Confirm an IAM role that is associated with your cluster or serverless namesspace has a IAM policy that allows the proper permissions. The policy must allow both `s3:GetObject` for a bucket resource such as `amzn-s3-demo-bucket` and `s3:ListBucket` for a bucket resource and its contents such as `amzn-s3-demo-bucket/*`.

------
#### [ JSON ]

****  

  ```
  {
      "Version":"2012-10-17",		 	 	 
      "Statement": [
          {
              "Sid": "AutoCopyReadId",
              "Effect": "Allow",
              "Action": [
                  "s3:GetObject",
                  "s3:ListBucket"
              ],
              "Resource": [
                  "arn:aws:s3:::amzn-s3-demo-bucket",  
                  "arn:aws:s3:::amzn-s3-demo-bucket/*" 
              ]
          }
      ]
  }
  ```

------

  Add your policy to an IAM role that has a trust relationship for the role is as follows.

------
#### [ JSON ]

****  

  ```
  {
      "Version":"2012-10-17",		 	 	 
      "Statement": [
          {
              "Effect": "Allow",
              "Principal": {
                  "Service": [
                      "redshift.amazonaws.com"
                  ]
              },
              "Action": "sts:AssumeRole"
          }
      ]
  }
  ```

------

  If your target data warehouse is a provisioned cluster, you can associate an IAM role to your provisioned cluster using the Amazon Redshift console, **Cluster permissions** tab in your cluster details. For information about how to associate a role to your provisioned cluster, see [Associating IAM roles with clusters](https://docs.aws.amazon.com/redshift/latest/mgmt/copy-unload-iam-role-associating-with-clusters.html) in the *Amazon Redshift Management Guide*.

  If your target data warehouse is Redshift Serverless, you can associate an IAM role to your serverless namespace using the Redshift Serverless console, **Security and encryption** tab in your namespace details. For information about how to associate a role to your serverless namespace, see [Granting permissions to Amazon Redshift Serverless](https://docs.aws.amazon.com/redshift/latest/mgmt/serverless-security-other-services.html) in the *Amazon Redshift Management Guide*.
+ Your Amazon Redshift data warehouse must also have a resource policy that allows the Amazon S3 bucket. If you use the Amazon Redshift console, when you create the s3 event integration, Amazon Redshift provides the option **Fix it for me** to add this policy to your Amazon Redshift data warehouse. To update a resource policy yourself, you can use the [put-resource-policy](https://docs.aws.amazon.com/cli/latest/reference/redshift/put-resource-policy.html) AWS CLI command. For example, to attach a resource policy to your Amazon Redshift provisioned cluster for an S3 event integration with an Amazon S3 bucket, run an AWS CLI command similar to the following. The following example shows a policy for a provisioned cluster namespace in the *us-east-1* AWS Region for user account *123456789012*. The bucket is named *amzn-s3-demo-bucket*.

  ```
  aws redshift put-resource-policy \
  --policy file://rs-rp.json \
  --resource-arn "arn:aws:redshift: us-east-1:123456789012:namespace/cc4ffe56-ad2c-4fd1-a5a2-f29124a56433"
  ```

  Where `rs-rp.json` contains:

------
#### [ JSON ]

****  

  ```
  {
  	"Version":"2012-10-17",		 	 	 
  	"Statement": [
  		{
  			"Effect": "Allow",
  			"Principal": {
  				"Service": "redshift.amazonaws.com"
  			},
  			"Action": "redshift:AuthorizeInboundIntegration",
  			"Resource": "arn:aws:redshift:us-east-1:123456789012:namespace:cc4ffe56-ad2c-4fd1-a5a2-f29124a56433",
  			"Condition": {
  				"StringEquals": {
  					"aws:SourceArn": "arn:aws:s3:::amzn-s3-demo-bucket",
  					"aws:SourceAccount": 111122223333
  				}
  			}
  		},
  		{
  			"Effect": "Allow",
  			"Principal": {
  				"AWS": "arn:aws:iam::111122223333:role/myRedshiftRole"
  			},
  			"Action": "redshift:CreateInboundIntegration",
  			"Resource": "arn:aws:redshift:us-east-1:123456789012:namespace:cc4ffe56-ad2c-4fd1-a5a2-f29124a56433",
  			"Condition": {
  				"StringEquals": {
  					"aws:SourceArn": "arn:aws:s3:::amzn-s3-demo-bucket",
  					"aws:SourceAccount": 111122223333
  				}
  			}
  		}
  	]
  }
  ```

------

  To attach a resource policy to your Redshift Serverless namespace for an S3 event integration with an Amazon S3 bucket, run a AWS CLI command similar to the following. The following example shows a policy for a serverless namespace in the *us-east-1* AWS Region for user account *123456789012*. The bucket is named *amzn-s3-demo-bucket*.

  ```
  aws redshift put-resource-policy \
  --policy file://rs-rp.json \
  --resource-arn "arn:aws:redshift-serverless:us-east-1:123456789012:namespace/namespace-1"
  ```

  Where `rs-rp.json` contains:

------
#### [ JSON ]

****  

  ```
  {
  	"Version":"2012-10-17",		 	 	 
  	"Statement": [
  		{
  			"Effect": "Allow",
  			"Principal": {
  				"Service": "redshift.amazonaws.com"
  			},
  			"Action": "redshift:AuthorizeInboundIntegration",
  			"Resource": "arn:aws:redshift-serverless:us-east-1:123456789012:namespace/namespace-1",
  			"Condition": {
  				"StringEquals": {
  					"aws:SourceArn": "arn:aws:s3:::amzn-s3-demo-bucket",
  					"aws:SourceAccount": 111122223333
  	
  				}
  			}
  		},
  		{
  			"Effect": "Allow",
  			"Principal": {
  				"AWS": "arn:aws:iam::123456789012:user/myUser"
  			},
  			"Action": "redshift:CreateInboundIntegration",
  			"Resource": "arn:aws:redshift-serverless:us-east-1:123456789012:namespace/namespace-1",
  			"Condition": {
  				"StringEquals": {
  					"aws:SourceArn": "arn:aws:s3:::amzn-s3-demo-bucket",
  					"aws:SourceAccount": 111122223333
  				}
  			}
  		}
  	]
  }
  ```

------

## Create an S3 event integration
<a name="loading-data-copy-job-create-s3-event-integration"></a>

To set up your copy job, you first define an S3 event integration.

------
#### [ Amazon Redshift console ]

**To create an Amazon S3 event integration on the Amazon Redshift console**

1. Sign in to the AWS Management Console and open the Amazon Redshift console at [https://console.aws.amazon.com/redshiftv2/](https://console.aws.amazon.com/redshiftv2/).

1. In the left navigation pane, choose **S3 event integrations**.

1. Choose **Create Amazon S3 event integration** to open the wizard to create and S3 event integration to use with auto-copy. Your source Amazon S3 bucket and target Amazon Redshift data warehouse must be in the same AWS Region. Specify the following information when going through the steps to create an integration:
   + **Integration name** – Is a unique identifier across all integrations owned by your AWS account in the current AWS Region.
   + **Description** – Is text that describes the Amazon S3 event integration for later reference.
   + **Source S3 bucket** – Is the Amazon S3 bucket in the current AWS account and AWS Region which is the source of ingesting data into Amazon Redshift.
   + **Amazon Redshift data warehouse** – Is the target Amazon Redshift provisioned cluster or Redshift Serverless workgroup that receives the data from the integration.

     If your target Amazon Redshift is in the same account, you are able to select the target. If the target is in a different account, you specify the **Amazon Redshift data warehouse ARN**. The target must have a resource policy with authorized principals and integration source. If you do not have the correct resource policies on the target and your target is in the same account, you can select the **Fix it for me** option to automatically apply the resource policies during the create integration process. If your target is in a different AWS account, you need to apply the resource policy on the Amazon Redshift warehouse manually.

1. Enter up to 50 tag **Keys** and with an optional **Value** – To provide additional metadata about the integration.

1. A review page is shown where you can choose **Create S3 event integration**.

------
#### [ AWS CLI ]

To create an Amazon S3 event integration using the AWS CLI, use the `create-integration` command with the following options:
+ `integration-name` – Specify a name for the integration.
+ `source-arn` – Specify the ARN of the Amazon S3 source bucket.
+ `target-arn` – Specify the namespace ARN of the Amazon Redshift provisioned cluster or Redshift Serverless workgroup target.

The following example creates an integration by providing the integration name, source ARN, and target ARN. The integration is not encrypted.

```
aws redshift create-integration \
--integration-name s3-integration \
--source-arn arn:aws:s3:us-east-1::s3-example-bucket \
--target-arn arn:aws:redshift:us-east-1:123456789012:namespace:a1b2c3d4-5678-90ab-cdef-EXAMPLE22222
          {
    "IntegrationArn": "arn:aws:redshift:us-east-1:123456789012:integration:a1b2c3d4-5678-90ab-cdef-EXAMPLE11111",
    "IntegrationName": "s3-integration",
    "SourceArn": "arn:aws:s3:::s3-example-bucket",
    "SourceType": "s3-event-notifications",
    "TargetArn": "arn:aws:redshift:us-east-1:123456789012:namespace:a1b2c3d4-5678-90ab-cdef-EXAMPLE22222",
    "Status": "creating",
    "Errors": [],
    "CreateTime": "2024-10-09T19:08:52.758000+00:00",
    "Tags": []
}
```

You can also use the following AWS CLI commands to manage your S3 event integration.
+ `delete-integration` – Specify an integration ARN to delete an S3 event integration.
+ `modify-integration` – Specify an integration ARN to change the name or description (or both) of an S3 event integration.
+ `describe-integrations` – Specify an integration ARN to view properties of an S3 event integration.

See the [https://docs.aws.amazon.com/cli/latest/reference/redshift/](https://docs.aws.amazon.com/cli/latest/reference/redshift/) for more information about these commands.

------

Amazon Redshift then creates an S3 event integration with its associated source and target, status, and information about the status of an associated auto-copy job. You can view information about an S3 event integration on the Amazon Redshift console by choosing **S3 event integrations**, and choosing the integration to show its details. Integrations are separated by those created **In my account** and **From other accounts**. The **In my account** list shows integrations where the source and target are in the same account. The **From other accounts** list shows integrations where the source is owned by another account.

If you delete an S3 event integration, the corresponding COPY JOB status changes from `1` (active) to `0` (inactive/pending). However, the corresponding COPY JOB is not automatically dropped. If later you try to create a COPY JOB with the same name, there might be a conflict.

## Create and monitor a COPY JOB
<a name="loading-data-copy-job-create-s3-autocopy"></a>

After the integration is created, on the **S3 event integration details** page for the integration you created, choose **Create autocopy job** to go to Amazon Redshift query editor v2 where you can create the auto-copy job for the integration. Amazon Redshift matches the bucket in the FROM clause in the COPY JOB CREATE statement to the bucket used in S3 event integration. For information about how to use Amazon Redshift query editor v2, see [Querying a database using the Amazon Redshift query editor v2](https://docs.aws.amazon.com/redshift/latest/mgmt/query-editor-v2.html) in the *Amazon Redshift Management Guide*. For example, run the following COPY command in query editor v2 to create an automatic COPY JOB that matches the Amazon S3 bucket `s3://amzn-s3-demo-bucket/staging-folder` to an Amazon S3 event integration.

```
COPY public.target_table
FROM 's3://amzn-s3-demo-bucket/staging-folder'
IAM_ROLE 'arn:aws:iam::123456789012:role/MyLoadRoleName'
JOB CREATE my_copy_job_name
AUTO ON;
```

You define a COPY JOB one time. The same parameters are used for future runs.

To define and manage a COPY JOB, you must have permission. For information about granting and revoking permission on a COPY JOB, see [GRANT](r_GRANT.md) and [REVOKE](r_REVOKE.md). For more information about granting and revoking scoped permissions for a COPY JOB, see [Granting scoped permissions](r_GRANT.md#grant-scoped-syntax) and [Revoking scoped permissions](r_REVOKE.md#revoke-scoped-permissions).

You manage the load operations using options to CREATE, LIST, SHOW, DROP, ALTER, and RUN jobs. For more information, see [COPY JOB](r_COPY-JOB.md).

You can query system views to see the COPY JOB status and progress. Views are provided as follows:
+ [SYS\$1COPY\$1JOB](SYS_COPY_JOB.md) – contains a row for each currently defined COPY JOB.
+ [SYS\$1COPY\$1JOB\$1DETAIL](SYS_COPY_JOB_DETAIL.md) – contains details on pending, error, and ingested files for each COPY JOB.
+ [SYS\$1COPY\$1JOB\$1INFO](SYS_COPY_JOB_INFO.md) – contains messages logged about a COPY JOB.
+ [SYS\$1LOAD\$1HISTORY](SYS_LOAD_HISTORY.md) – contains details of COPY commands.
+ [SYS\$1LOAD\$1ERROR\$1DETAIL](SYS_LOAD_ERROR_DETAIL.md) – contains details of COPY command errors.
+ [SVV\$1COPY\$1JOB\$1INTEGRATIONS](SVV_COPY_JOB_INTEGRATIONS.md) – contains details of S3 event integrations.
+ [STL\$1LOAD\$1ERRORS](r_STL_LOAD_ERRORS.md) – contains errors from COPY commands.
+ [STL\$1LOAD\$1COMMITS](r_STL_LOAD_COMMITS.md) – contains information used to troubleshoot a COPY command data load.

For information about troubleshooting S3 event integration errors, see [Troubleshooting S3 event integration and COPY JOB errors](s3-integration-troubleshooting.md).

To get the list of files loaded by a COPY JOB, run the following SQL, but first replace *<job\$1id>*:

```
SELECT job_id, job_name, data_source, copy_query, filename, status, curtime
FROM sys_copy_job copyjob
JOIN stl_load_commits loadcommit
ON copyjob.job_id = loadcommit.copy_job_id
WHERE job_id = <job_id>;
```

## Considerations when creating S3 event integration for auto-copy
<a name="loading-data-copy-job-considerations"></a>

Consider the following when using auto-copy.
+ You can create a maximum of 200 COPY JOBS for each cluster or workgroup in an AWS account.
+ You can create a maximum of 50 S3 event integrations for each Amazon Redshift target.
+ You can't create an S3 event integration with a source Amazon S3 bucket which has a period (.) in the bucket name.
+ You can only create one S3 event integration between the same source and target. That is, there can only be one S3 event integration between an Amazon S3 bucket and a Amazon Redshift data warehouse at a time.
+ You can't have any existing event notifications for event type `S3_OBJECT_CREATED` that are defined on the source Amazon S3 bucket. However, after an S3 event integration is created, you can update the Amazon S3 bucket event notification with a prefix/suffix with a narrower scope. In this way, you can also configure `S3_OBJECT_CREATED` for another prefix/suffix to other targets and avoid a conflict with S3 event integration. If you experience problems that auto-copy was not running as expected, prepare the AWS CloudTrail log of the `s3:PutBucketNotificationConfiguration` action on your S3 bucket for the time frame in question when you contact AWS Support.

## Supported Regions
<a name="loading-data-copy-job-regions"></a>

The following Regions are available for auto-copy.


| Region | Auto-copy | 
| --- | --- | 
| Africa (Cape Town) | Available | 
| Asia Pacific (Hong Kong) | Available | 
| Asia Pacific (Taipei) | Available | 
| Asia Pacific (Tokyo) | Available | 
| Asia Pacific (Seoul) | Available | 
| Asia Pacific (Osaka) | Available | 
| Asia Pacific (Mumbai) | Available | 
| Asia Pacific (Hyderabad) | Available | 
| Asia Pacific (Singapore) | Available | 
| Asia Pacific (Sydney) | Available | 
| Asia Pacific (Jakarta) | Available | 
| Asia Pacific (Melbourne) | Available | 
| Asia Pacific (Malaysia) | Available | 
| Asia Pacific (New Zealand) | Not available | 
| Asia Pacific (Thailand) | Available | 
| Canada (Central) | Available | 
| Canada West (Calgary) | Available | 
| China (Beijing) | Available | 
| China (Ningxia) | Available | 
| Europe (Frankfurt) | Available | 
| Europe (Zurich) | Available | 
| Europe (Stockholm) | Available | 
| Europe (Milan) | Available | 
| Europe (Spain) | Available | 
| Europe (Ireland) | Available | 
| Europe (London) | Available | 
| Europe (Paris) | Available | 
| Israel (Tel Aviv) | Available | 
| Middle East (UAE) | Available | 
| Middle East (Bahrain) | Available | 
| Mexico (Central) | Available | 
| South America (São Paulo) | Available | 
| US East (N. Virginia) | Available | 
| US East (Ohio) | Available | 
| US West (N. California) | Available | 
| US West (Oregon) | Available | 
| AWS GovCloud (US-East) | Available | 
| AWS GovCloud (US-West) | Available | 

# Loading tables with DML commands
<a name="t_Updating_tables_with_DML_commands"></a>

Amazon Redshift supports standard data manipulation language (DML) commands (INSERT, UPDATE, and DELETE) that you can use to modify rows in tables. You can also use the TRUNCATE command to do fast bulk deletes.

**Note**  
We strongly encourage you to use the [COPY](r_COPY.md) command to load large amounts of data. Using individual INSERT statements to populate a table might be prohibitively slow. Alternatively, if your data already exists in other Amazon Redshift database tables, use INSERT INTO ... SELECT FROM or CREATE TABLE AS to improve performance. For information, see [INSERT](r_INSERT_30.md) or [CREATE TABLE AS](r_CREATE_TABLE_AS.md).

If you insert, update, or delete a significant number of rows in a table, relative to the number of rows before the changes, run the ANALYZE and VACUUM commands against the table when you are done. If a number of small changes accumulate over time in your application, you might want to schedule the ANALYZE and VACUUM commands to run at regular intervals. For more information, see [Analyzing tables](t_Analyzing_tables.md) and [Vacuuming tables](t_Reclaiming_storage_space202.md).

**Topics**
+ [Updating and inserting new data](t_updating-inserting-using-staging-tables-.md)

# Updating and inserting new data
<a name="t_updating-inserting-using-staging-tables-"></a>

You can efficiently add new data to an existing table by using the MERGE command. Perform a merge operation by creating a staging table and then using one of the methods described in this section to update the target table from the staging table. For more information on the MERGE command, see [MERGE](r_MERGE.md).

The [Merge examples](merge-examples.md) use a sample dataset for Amazon Redshift, called the TICKIT data set. As a prerequisite, you can set up the TICKIT tables and data by following the instructions available in [Getting started with common database tasks](https://docs.aws.amazon.com/redshift/latest/gsg/database-tasks.html). More detailed information about the sample data set is found at [Sample database](https://docs.aws.amazon.com/redshift/latest/dg/c_sampledb.html). 

## Merge method 1: Replacing existing rows
<a name="merge-method-replace-existing-rows"></a>

If you are overwriting all of the columns in the target table, the fastest method to perform a merge is to replace the existing rows. This scans the target table only once, by using an inner join to delete rows that will be updated. After the rows are deleted, they are replaced with new rows by a single insert operation from the staging table. 

Use this method if all of the following are true: 
+ Your target table and your staging table contain the same columns. 
+ You intend to replace all of the data in the target table columns with all of the staging table columns.
+ You will use all of the rows in the staging table in the merge.

If any of these criteria do not apply, use Merge method 2: Specifying a column list without using MERGE, described in the following section.

If you will not use all of the rows in the staging table, filter the DELETE and INSERT statements by using a WHERE clause to leave out rows that are not changing. However, if most of the rows in the staging table will not participate in the merge, we recommend performing an UPDATE and an INSERT in separate steps, as described later in this section.

## Merge method 2: Specifying a column list without using MERGE
<a name="merge-method-specify-column-list"></a>

Use this method to update specific columns in the target table instead of overwriting entire rows. This method takes longer than the previous method because it requires an extra update step and doesn't use the MERGE command. Use this method if any of the following are true: 
+ Not all of the columns in the target table are to be updated. 
+ Most rows in the staging table will not be used in the updates. 

**Topics**
+ [Merge method 1: Replacing existing rows](#merge-method-replace-existing-rows)
+ [Merge method 2: Specifying a column list without using MERGE](#merge-method-specify-column-list)
+ [Creating a temporary staging table](merge-create-staging-table.md)
+ [Performing a merge operation by replacing existing rows](merge-replacing-existing-rows.md)
+ [Performing a merge operation by specifying a column list without using the MERGE command](merge-specify-a-column-list.md)
+ [Merge examples](merge-examples.md)

# Creating a temporary staging table
<a name="merge-create-staging-table"></a>

The *staging table* is a temporary table that holds all of the data that will be used to make changes to the *target table*, including both updates and inserts. 

A merge operation requires a join between the staging table and the target table. To collocate the joining rows, set the staging table's distribution key to the same column as the target table's distribution key. For example, if the target table uses a foreign key column as its distribution key, use the same column for the staging table's distribution key. If you create the staging table by using a [CREATE TABLE LIKE](r_CREATE_TABLE_NEW.md#create-table-like) statement, the staging table will inherit the distribution key from the parent table. If you use a CREATE TABLE AS statement, the new table does not inherit the distribution key. For more information, see [Data distribution for query optimization](t_Distributing_data.md)

If the distribution key is not the same as the primary key and the distribution key is not updated as part of the merge operation, add a redundant join predicate on the distribution key columns to enable a collocated join. For example: 

```
where target.primarykey = stage.primarykey 
and target.distkey = stage.distkey
```

To verify that the query will use a collocated join, run the query with [EXPLAIN](r_EXPLAIN.md) and check for DS\$1DIST\$1NONE on all of the joins. For more information, see [Evaluating the query plan](c_data_redistribution.md)

# Performing a merge operation by replacing existing rows
<a name="merge-replacing-existing-rows"></a>

When you run the merge operation detailed in the procedure, put all of the steps except for creating and dropping the temporary staging table in a single transaction. The transaction rolls back if any step fails. Using a single transaction also reduces the number of commits, which saves time and resources.

**To perform a merge operation by replacing existing rows**

1. Create a staging table, and then populate it with data to be merged, as shown in the following pseudocode.

   ```
   CREATE temp table stage (like target); 
   
   INSERT INTO stage 
   SELECT * FROM source 
   WHERE source.filter = 'filter_expression';
   ```

1.  Use MERGE to perform an inner join with the staging table to update the rows from the target table that match the staging table, then insert all the remaining rows into the target table that don't match the staging table.

    We recommend you run the update and insert operations in a single MERGE command.

   ```
   MERGE INTO target 
   USING stage [optional alias] on (target.primary_key = stage.primary_key)
   WHEN MATCHED THEN 
   UPDATE SET col_name1 = stage.col_name1 , col_name2= stage.col_name2, col_name3 = {expr}
   WHEN NOT MATCHED THEN
   INSERT (col_name1 , col_name2, col_name3) VALUES (stage.col_name1, stage.col_name2, {expr});
   ```

1. Drop the staging table. 

   ```
   DROP TABLE stage;
   ```

# Performing a merge operation by specifying a column list without using the MERGE command
<a name="merge-specify-a-column-list"></a>

When you run the merge operation detailed in the procedure, put all of the steps in a single transaction. The transaction rolls back if any step fails. Using a single transaction also reduces the number of commits, which saves time and resources.

**To perform a merge operation by specifying a column list**

1. Put the entire operation in a single transaction block. 

   ```
   BEGIN transaction;
   … 
   END transaction;
   ```

1. Create a staging table, and then populate it with data to be merged, as shown in the following pseudocode. 

   ```
   create temp table stage (like target); 
   insert into stage 
   select * from source 
   where source.filter = 'filter_expression';
   ```

1. Update the target table by using an inner join with the staging table. 
   + In the UPDATE clause, explicitly list the columns to be updated. 
   + Perform an inner join with the staging table. 
   + If the distribution key is different from the primary key and the distribution key is not being updated, add a redundant join on the distribution key. To verify that the query will use a collocated join, run the query with [EXPLAIN](r_EXPLAIN.md) and check for DS\$1DIST\$1NONE on all of the joins. For more information, see [Evaluating the query plan](c_data_redistribution.md)
   + If your target table is sorted by timestamp, add a predicate to take advantage of range-restricted scans on the target table. For more information, see [Amazon Redshift best practices for designing queries](c_designing-queries-best-practices.md).
   + If you will not use all of the rows in the merge, add a clause to filter the rows that you want to change. For example, add an inequality filter on one or more columns to exclude rows that have not changed.
   + Put the update, delete, and insert operations in a single transaction block so that if there is a problem, everything will be rolled back.

    For example: 

   ```
   begin transaction;
   
   update target 
   set col1 = stage.col1, 
   col2 = stage.col2, 
   col3 = 'expression' 
   from stage 
   where target.primarykey = stage.primarykey 
   and target.distkey = stage.distkey 
   and target.col3 > 'last_update_time' 
   and (target.col1 != stage.col1 
   or target.col2 != stage.col2 
   or target.col3 = 'filter_expression');
   ```

1. Delete unneeded rows from the staging table by using an inner join with the target table. Some rows in the target table already match the corresponding rows in the staging table, and others were updated in the previous step. In either case, they are not needed for the insert. 

   ```
   delete from stage 
   using target 
   where stage.primarykey = target.primarykey;
   ```

1. Insert the remaining rows from the staging table. Use the same column list in the VALUES clause that you used in the UPDATE statement in step two. 

   ```
   insert into target
   (select col1, col2, 'expression'
   from stage);
   
   end transaction;
   ```

1. Drop the staging table. 

   ```
   drop table stage;
   ```

# Merge examples
<a name="merge-examples"></a>

The following examples perform a merge to update the SALES table. The first example uses the simpler method of deleting from the target table and then inserting all of the rows from the staging table. The second example requires updating on select columns in the target table, so it includes an extra update step. 

The [Merge examples](#merge-examples) use a sample dataset for Amazon Redshift, called the TICKIT data set. As a prerequisite, you can set up the TICKIT tables and data by following the instructions available in the guide [Getting started with common database tasks](https://docs.aws.amazon.com/redshift/latest/gsg/database-tasks.html). More detailed information about the sample data set is found at [Sample database](https://docs.aws.amazon.com/redshift/latest/dg/c_sampledb.html). 

**Sample merge data source**

The examples in this section need a sample data source that includes both updates and inserts. For the examples, we will create a sample table named SALES\$1UPDATE that uses data from the SALES table. We'll populate the new table with random data that represents new sales activity for December. We will use the SALES\$1UPDATE sample table to create the staging table in the examples that follow. 

```
-- Create a sample table as a copy of the SALES table.

create table tickit.sales_update as
select * from tickit.sales;

-- Change every fifth row to have updates.

update tickit.sales_update
set qtysold = qtysold*2,
pricepaid = pricepaid*0.8,
commission = commission*1.1
where saletime > '2008-11-30'
and mod(sellerid, 5) = 0;

-- Add some new rows to have inserts.
-- This example creates a duplicate of every fourth row.

insert into tickit.sales_update
select (salesid + 172456) as salesid, listid, sellerid, buyerid, eventid, dateid, qtysold, pricepaid, commission, getdate() as saletime
from tickit.sales_update
where saletime > '2008-11-30'
and mod(sellerid, 4) = 0;
```

**Example of a merge that replaces existing rows based on matching keys**

The following script uses the SALES\$1UPDATE table to perform a merge operation on the SALES table with new data for December sales activity. This example replaces rows in the SALES table that have updates. For this example, we will update the qtysold and pricepaid columns, but leave commission and saletime unchanged.

```
MERGE into tickit.sales 
USING tickit.sales_update sales_update  
on ( sales.salesid = sales_update.salesid
and sales.listid = sales_update.listid
and sales_update.saletime > '2008-11-30'
and (sales.qtysold != sales_update.qtysold 
or sales.pricepaid != sales_update.pricepaid))
WHEN MATCHED THEN
update SET qtysold = sales_update.qtysold,
pricepaid = sales_update.pricepaid
WHEN NOT MATCHED THEN 
INSERT (salesid, listid, sellerid, buyerid, eventid, dateid, qtysold , pricepaid, commission, saletime)
values (sales_update.salesid, sales_update.listid, sales_update.sellerid, sales_update.buyerid, sales_update.eventid, 
sales_update.dateid, sales_update.qtysold , sales_update.pricepaid, sales_update.commission, sales_update.saletime);

-- Drop the staging table.
drop table tickit.sales_update;

-- Test to see that commission and salestime were not impacted.
SELECT sales.salesid, sales.commission, sales.salestime, sales_update.commission, sales_update.salestime 
FROM tickit.sales 
INNER JOIN tickit.sales_update sales_update  
ON 
sales.salesid = sales_update.salesid
AND sales.listid = sales_update.listid
AND sales_update.saletime > '2008-11-30'
AND (sales.commission != sales_update.commission 
OR sales.salestime != sales_update.salestime);
```

**Example of a merge that specifies a column list without using MERGE**

The following example performs a merge operation to update SALES with new data for December sales activity. We need sample data that includes both updates and inserts, along with rows that have not changed. For this example, we want to update the QTYSOLD and PRICEPAID columns but leave COMMISSION and SALETIME unchanged. The following script uses the SALES\$1UPDATE table to perform a merge operation on the SALES table. 

```
-- Create a staging table and populate it with rows from SALES_UPDATE for Dec
create temp table stagesales as select * from sales_update
where saletime > '2008-11-30';

-- Start a new transaction
begin transaction;

-- Update the target table using an inner join with the staging table
-- The join includes a redundant predicate to collocate on the distribution key –- A filter on saletime enables a range-restricted scan on SALES

update sales
set qtysold = stagesales.qtysold,
pricepaid = stagesales.pricepaid
from stagesales
where sales.salesid = stagesales.salesid
and sales.listid = stagesales.listid
and stagesales.saletime > '2008-11-30'
and (sales.qtysold != stagesales.qtysold 
or sales.pricepaid != stagesales.pricepaid);
 
-- Delete matching rows from the staging table 
-- using an inner join with the target table

delete from stagesales
using sales
where sales.salesid = stagesales.salesid
and sales.listid = stagesales.listid;

-- Insert the remaining rows from the staging table into the target table
insert into sales
select * from stagesales;

-- End transaction and commit
end transaction;

-- Drop the staging table
drop table stagesales;
```

# Performing a deep copy
<a name="performing-a-deep-copy"></a>

A deep copy recreates and repopulates a table by using a bulk insert, which automatically sorts the table. If a table has a large unsorted Region, a deep copy is much faster than a vacuum. We recommend that you only make concurrent updates during a deep copy operation if you can track them. After the process has completed, move the delta updates into the new table. A VACUUM operation supports concurrent updates automatically. 

You can choose one of the following methods to create a copy of the original table: 
+ Use the original table DDL. 

  If the CREATE TABLE DDL is available, this is the fastest and preferred method. If you create a new table, you can specify all table and column attributes, including primary key and foreign keys. You can find the original DDL by using the SHOW TABLE function.
+ Use CREATE TABLE LIKE. 

  If the original DDL is not available, you can use CREATE TABLE LIKE to recreate the original table. The new table inherits the encoding, distribution key, sort key, and not-null attributes of the parent table. The new table doesn't inherit the primary key and foreign key attributes of the parent table, but you can add them using [ALTER TABLE](r_ALTER_TABLE.md).
+ Create a temporary table and truncate the original table. 

  If you must retain the primary key and foreign key attributes of the parent table. If the parent table has dependencies, you can use CREATE TABLE ... AS (CTAS) to create a temporary table. Then truncate the original table and populate it from the temporary table. 

  Using a temporary table improves performance significantly compared to using a permanent table, but there is a risk of losing data. A temporary table is automatically dropped at the end of the session in which it is created. TRUNCATE commits immediately, even if it is inside a transaction block. If the TRUNCATE succeeds but the session shuts down before the following INSERT completes, the data is lost. If data loss is unacceptable, use a permanent table. 

After you create a copy of a table, you might have to grant access to the new table. You can use [GRANT](r_GRANT.md) to define access privileges. To view and grant all of a table's access privileges, you must be one of the following: 
+  A superuser. 
+  The owner of the table you want to copy. 
+  A user with the ACCESS SYSTEM TABLE privilege to see the table's privileges, and with the grant privilege for all relevant permissions. 

Additionally, you might have to grant usage permission for the schema your deep copy is in. Granting usage permission is necessary if your deep copy's schema is different from the original table's schema, and also isn't the `public` schema. To view and grant usage privileges you must be one of the following:
+  A superuser. 
+  A user who can grant the USAGE permission for the deep copy's schema. 

**To perform a deep copy using the original table DDL**

1. (Optional) Recreate the table DDL by running a script called `v_generate_tbl_ddl`. 

1. Create a copy of the table using the original CREATE TABLE DDL.

1. Use an INSERT INTO … SELECT statement to populate the copy with data from the original table. 

1. Check for permissions granted on the old table. You can see these permissions in the SVV\$1RELATION\$1PRIVILEGES system view.

1. If necessary, grant the permissions of the old table to the new table.

1. Grant usage permission to every group and user that has privileges in the original table. This step isn't necessary if your deep copy table is in the `public` schema, or is in the same schema as the original table.

1. Drop the original table.

1. Use an ALTER TABLE statement to rename the copy to the original table name.

The following example performs a deep copy on the SAMPLE table using a duplicate of SAMPLE named sample\$1copy.

```
--Create a copy of the original table in the sample_namespace namespace using the original CREATE TABLE DDL.
create table sample_namespace.sample_copy ( … );

--Populate the copy with data from the original table in the public namespace.
insert into sample_namespace.sample_copy (select * from public.sample);

--Check SVV_RELATION_PRIVILEGES for the original table's privileges.
select * from svv_relation_privileges where namespace_name = 'public' and relation_name = 'sample' order by identity_type, identity_id, privilege_type;

--Grant the original table's privileges to the copy table.
grant DELETE on table sample_namespace.sample_copy to group group1;
grant INSERT, UPDATE on table sample_namespace.sample_copy to group group2;
grant SELECT on table sample_namespace.sample_copy to user1;
grant INSERT, SELECT, UPDATE on table sample_namespace.sample_copy to user2;
         
--Grant usage permission to every group and user that has privileges in the original table.
grant USAGE on schema sample_namespace to group group1, group group2, user1, user2;

--Drop the original table.
drop table public.sample;

--Rename the copy table to match the original table's name.
alter table sample_namespace.sample_copy rename to sample;
```

**To perform a deep copy using CREATE TABLE LIKE**

1. Create a new table using CREATE TABLE LIKE. 

1. Use an INSERT INTO … SELECT statement to copy the rows from the current table to the new table. 

1. Check for permissions granted on the old table. You can see these permissions in the SVV\$1RELATION\$1PRIVILEGES system view.

1. If necessary, grant the permissions of the old table to the new table.

1. Grant usage permission to every group and user that has privileges in the original table. This step isn't necessary if your deep copy table is in the `public` schema, or is in the same schema as the original table.

1. Drop the current table. 

1. Use an ALTER TABLE statement to rename the new table to the original table name. 

The following example performs a deep copy on the SAMPLE table using CREATE TABLE LIKE.

```
--Create a copy of the original table in the sample_namespace namespace using CREATE TABLE LIKE.
create table sameple_namespace.sample_copy (like public.sample);

--Populate the copy with data from the original table.
insert into sample_namespace.sample_copy (select * from public.sample);

--Check SVV_RELATION_PRIVILEGES for the original table's privileges.
select * from svv_relation_privileges where namespace_name = 'public' and relation_name = 'sample' order by identity_type, identity_id, privilege_type;

--Grant the original table's privileges to the copy table.
grant DELETE on table sample_namespace.sample_copy to group group1;
grant INSERT, UPDATE on table sample_namespace.sample_copy to group group2;
grant SELECT on table sample_namespace.sample_copy to user1;
grant INSERT, SELECT, UPDATE on table sample_namespace.sample_copy to user2;
         
--Grant usage permission to every group and user that has privileges in the original table.
grant USAGE on schema sample_namespace to group group1, group group2, user1, user2;

--Drop the original table.
drop table public.sample;

--Rename the copy table to match the original table's name.
alter table sample_namespace.sample_copy rename to sample;
```

**To perform a deep copy by creating a temporary table and truncating the original table**

1. Use CREATE TABLE AS to create a temporary table with the rows from the original table. 

1. Truncate the current table. 

1. Use an INSERT INTO … SELECT statement to copy the rows from the temporary table to the original table. 

1. Drop the temporary table. 

The following example performs a deep copy on the SALES table by creating a temporary table and truncating the original table. Since the original table remains, you don't need to grant permissions to the copy table.

```
--Create a temp table copy using CREATE TABLE AS.
create temp table salestemp as select * from sales;

--Truncate the original table.
truncate sales;

--Copy the rows from the temporary table to the original table.
insert into sales (select * from salestemp);

--Drop the temporary table.
drop table salestemp;
```

# Analyzing tables
<a name="t_Analyzing_tables"></a>

The ANALYZE operation updates the statistical metadata that the query planner uses to choose optimal plans.

In most cases, you don't need to explicitly run the ANALYZE command. Amazon Redshift monitors changes to your workload and automatically updates statistics in the background. In addition, the COPY command performs an analysis automatically when it loads data into an empty table. 

To explicitly analyze a table or the entire database, run the [ANALYZE](r_ANALYZE.md) command. 

## Automatic analyze
<a name="t_Analyzing_tables-auto-analyze"></a>

Amazon Redshift continuously monitors your database and automatically performs analyze operations in the background. To minimize impact to your system performance, automatic analyze runs during periods when workloads are light. 

Automatic analyze is enabled by default. To turn off automatic analyze, set the `auto_analyze` parameter to **false** by modifying your cluster's parameter group. 

To reduce processing time and improve overall system performance, Amazon Redshift skips automatic analyze for any table where the extent of modifications is small. 

An analyze operation skips tables that have up-to-date statistics. If you run ANALYZE as part of your extract, transform, and load (ETL) workflow, automatic analyze skips tables that have current statistics. Similarly, an explicit ANALYZE skips tables when automatic analyze has updated the table's statistics. 

## Analysis of new table data
<a name="t_Analyzing_tables-new-tables"></a>

 By default, the COPY command performs an ANALYZE after it loads data into an empty table. You can force an ANALYZE regardless of whether a table is empty by setting STATUPDATE ON. If you specify STATUPDATE OFF, an ANALYZE is not performed. Only the table owner or a superuser can run the ANALYZE command or run the COPY command with STATUPDATE set to ON.

Amazon Redshift also analyzes new tables that you create with the following commands:
+ CREATE TABLE AS (CTAS) 
+ CREATE TEMP TABLE AS 
+ SELECT INTO 

Amazon Redshift returns a warning message when you run a query against a new table that was not analyzed after its data was initially loaded. No warning occurs when you query a table after a subsequent update or load. The same warning message is returned when you run the EXPLAIN command on a query that references tables that have not been analyzed.

Whenever adding data to a nonempty table significantly changes the size of the table, you can explicitly update statistics. You do so either by running an ANALYZE command or by using the STATUPDATE ON option with the COPY command. To view details about the number of rows that have been inserted or deleted since the last ANALYZE, query the [PG\$1STATISTIC\$1INDICATOR](r_PG_STATISTIC_INDICATOR.md) system catalog table. 

You can specify the scope of the [ANALYZE](r_ANALYZE.md) command to one of the following: 
+ The entire current database
+ A single table
+ One or more specific columns in a single table
+ Columns that are likely to be used as predicates in queries

 The ANALYZE command gets a sample of rows from the table, does some calculations, and saves resulting column statistics. By default, Amazon Redshift runs a sample pass for the DISTKEY column and another sample pass for all of the other columns in the table. If you want to generate statistics for a subset of columns, you can specify a comma-separated column list. You can run ANALYZE with the PREDICATE COLUMNS clause to skip columns that aren’t used as predicates.

 ANALYZE operations are resource intensive, so run them only on tables and columns that actually require statistics updates. You don't need to analyze all columns in all tables regularly or on the same schedule. If the data changes substantially, analyze the columns that are frequently used in the following:
+ Sorting and grouping operations
+ Joins
+ Query predicates

To reduce processing time and improve overall system performance, Amazon Redshift skips ANALYZE for any table that has a low percentage of changed rows, as determined by the [analyze\$1threshold\$1percent](r_analyze_threshold_percent.md) parameter. By default, the analyze threshold is set to 10 percent. You can change the analyze threshold for the current session by running a [SET](r_SET.md) command.

Columns that are less likely to require frequent analysis are those that represent facts and measures and any related attributes that are never actually queried, such as large VARCHAR columns. For example, consider the LISTING table in the TICKIT database.

```
select "column", type, encoding, distkey, sortkey
from pg_table_def where tablename = 'listing';


column         |        type        | encoding | distkey | sortkey 
---------------+--------------------+----------+---------+---------
listid         | integer            | none     | t       | 1       
sellerid       | integer            | none     | f       | 0       
eventid        | integer            | mostly16 | f       | 0       
dateid         | smallint           | none     | f       | 0       
numtickets     | smallint           | mostly8  | f       | 0       
priceperticket | numeric(8,2)       | bytedict | f       | 0       
totalprice     | numeric(8,2)       | mostly32 | f       | 0       
listtime       | timestamp with...  | none     | f       | 0
```

If this table is loaded every day with a large number of new records, the LISTID column, which is frequently used in queries as a join key, must be analyzed regularly. If TOTALPRICE and LISTTIME are the frequently used constraints in queries, you can analyze those columns and the distribution key on every weekday.

```
analyze listing(listid, totalprice, listtime);
```

Suppose that the sellers and events in the application are much more static, and the date IDs refer to a fixed set of days covering only two or three years. In this case,the unique values for these columns don't change significantly. However, the number of instances of each unique value will increase steadily. 

In addition, consider the case where the NUMTICKETS and PRICEPERTICKET measures are queried infrequently compared to the TOTALPRICE column. In this case, you can run the ANALYZE command on the whole table once every weekend to update statistics for the five columns that are not analyzed daily: 
<a name="t_Analyzing_tables-predicate-columns"></a>
**Predicate columns**  
As a convenient alternative to specifying a column list, you can choose to analyze only the columns that are likely to be used as predicates. When you run a query, any columns that are used in a join, filter condition, or group by clause are marked as predicate columns in the system catalog. When you run ANALYZE with the PREDICATE COLUMNS clause, the analyze operation includes only columns that meet the following criteria:
+ The column is marked as a predicate column.
+ The column is a distribution key.
+ The column is part of a sort key.

If none of a table's columns are marked as predicates, ANALYZE includes all of the columns, even when PREDICATE COLUMNS is specified. If no columns are marked as predicate columns, it might be because the table has not yet been queried. 

You might choose to use PREDICATE COLUMNS when your workload's query pattern is relatively stable. When the query pattern is variable, with different columns frequently being used as predicates, using PREDICATE COLUMNS might temporarily result in stale statistics. Stale statistics can lead to suboptimal query runtime plans and long runtimes. However, the next time you run ANALYZE using PREDICATE COLUMNS, the new predicate columns are included. 

To view details for predicate columns, use the following SQL to create a view named PREDICATE\$1COLUMNS. 

```
CREATE VIEW predicate_columns AS
WITH predicate_column_info as (
SELECT ns.nspname AS schema_name, c.relname AS table_name, a.attnum as col_num,  a.attname as col_name,
        CASE
            WHEN 10002 = s.stakind1 THEN array_to_string(stavalues1, '||') 
            WHEN 10002 = s.stakind2 THEN array_to_string(stavalues2, '||')
            WHEN 10002 = s.stakind3 THEN array_to_string(stavalues3, '||')
            WHEN 10002 = s.stakind4 THEN array_to_string(stavalues4, '||')
            ELSE NULL::varchar
        END AS pred_ts
   FROM pg_statistic s
   JOIN pg_class c ON c.oid = s.starelid
   JOIN pg_namespace ns ON c.relnamespace = ns.oid
   JOIN pg_attribute a ON c.oid = a.attrelid AND a.attnum = s.staattnum)
SELECT schema_name, table_name, col_num, col_name,
       pred_ts NOT LIKE '2000-01-01%' AS is_predicate,
       CASE WHEN pred_ts NOT LIKE '2000-01-01%' THEN (split_part(pred_ts, '||',1))::timestamp ELSE NULL::timestamp END as first_predicate_use,
       CASE WHEN pred_ts NOT LIKE '%||2000-01-01%' THEN (split_part(pred_ts, '||',2))::timestamp ELSE NULL::timestamp END as last_analyze
FROM predicate_column_info;
```

Suppose that you run the following query against the LISTING table. Note that LISTID, LISTTIME, and EVENTID are used in the join, filter, and group by clauses.

```
select s.buyerid,l.eventid, sum(l.totalprice)
from listing l
join sales s on l.listid = s.listid
where l.listtime > '2008-12-01'
group by l.eventid, s.buyerid;
```

When you query the PREDICATE\$1COLUMNS view, as shown in the following example, you see that LISTID, EVENTID, and LISTTIME are marked as predicate columns.

```
select * from predicate_columns 
where table_name = 'listing';
```

```
schema_name | table_name | col_num | col_name       | is_predicate | first_predicate_use | last_analyze       
------------+------------+---------+----------------+--------------+---------------------+--------------------
public      | listing    |       1 | listid         | true         | 2017-05-05 19:27:59 | 2017-05-03 18:27:41
public      | listing    |       2 | sellerid       | false        |                     | 2017-05-03 18:27:41
public      | listing    |       3 | eventid        | true         | 2017-05-16 20:54:32 | 2017-05-03 18:27:41
public      | listing    |       4 | dateid         | false        |                     | 2017-05-03 18:27:41
public      | listing    |       5 | numtickets     | false        |                     | 2017-05-03 18:27:41
public      | listing    |       6 | priceperticket | false        |                     | 2017-05-03 18:27:41
public      | listing    |       7 | totalprice     | false        |                     | 2017-05-03 18:27:41
public      | listing    |       8 | listtime       | true         | 2017-05-16 20:54:32 | 2017-05-03 18:27:41
```

Keeping statistics current improves query performance by enabling the query planner to choose optimal plans. Amazon Redshift refreshes statistics automatically in the background, and you can also explicitly run the ANALYZE command. If you choose to explicitly run ANALYZE, do the following:
+ Run the ANALYZE command before running queries.
+ Run the ANALYZE command on the database routinely at the end of every regular load or update cycle.
+ Run the ANALYZE command on any new tables that you create and any existing tables or columns that undergo significant change.
+ Consider running ANALYZE operations on different schedules for different types of tables and columns, depending on their use in queries and their propensity to change.
+ To save time and cluster resources, use the PREDICATE COLUMNS clause when you run ANALYZE.

You don't have to explicitly run the ANALYZE command after restoring a snapshot to a provisioned cluster or serverless namespace, nor after resuming a paused provisioned cluster. Amazon Redshift preserves system table information in these cases, making manual ANALYZE commands unnecessary. Amazon Redshift will continue to run automatic analyze operations as needed.

An analyze operation skips tables that have up-to-date statistics. If you run ANALYZE as part of your extract, transform, and load (ETL) workflow, automatic analyze skips tables that have current statistics. Similarly, an explicit ANALYZE skips tables when automatic analyze has updated the table's statistics.

## ANALYZE command history
<a name="c_check_last_analyze"></a>

It's useful to know when the last ANALYZE command was run on a table or database. When an ANALYZE command is run, Amazon Redshift runs multiple queries that look like this: 

```
padb_fetch_sample: select * from table_name
```

Query STL\$1ANALYZE to view the history of analyze operations. If Amazon Redshift analyzes a table using automatic analyze, the `is_background` column is set to `t` (true). Otherwise, it is set to `f` (false). The following example joins STV\$1TBL\$1PERM to show the table name and runtime details.

```
select distinct a.xid, trim(t.name) as name, a.status, a.rows, a.modified_rows, a.starttime, a.endtime
from stl_analyze a 
join stv_tbl_perm t  on t.id=a.table_id
where name = 'users'
order by starttime;


xid    | name  | status          | rows  | modified_rows | starttime           | endtime            
-------+-------+-----------------+-------+---------------+---------------------+--------------------
  1582 | users | Full            | 49990 |         49990 | 2016-09-22 22:02:23 | 2016-09-22 22:02:28
244287 | users | Full            | 24992 |         74988 | 2016-10-04 22:50:58 | 2016-10-04 22:51:01
244712 | users | Full            | 49984 |         24992 | 2016-10-04 22:56:07 | 2016-10-04 22:56:07
245071 | users | Skipped         | 49984 |             0 | 2016-10-04 22:58:17 | 2016-10-04 22:58:17
245439 | users | Skipped         | 49984 |          1982 | 2016-10-04 23:00:13 | 2016-10-04 23:00:13
(5 rows)
```

Alternatively, you can run a more complex query that returns all the statements that ran in every completed transaction that included an ANALYZE command: 

```
select xid, to_char(starttime, 'HH24:MM:SS.MS') as starttime,
datediff(sec,starttime,endtime ) as secs, substring(text, 1, 40)
from svl_statementtext
where sequence = 0
and xid in (select xid from svl_statementtext s where s.text like 'padb_fetch_sample%' )
order by xid desc, starttime;

xid  |  starttime   | secs |                  substring
-----+--------------+------+------------------------------------------
1338 | 12:04:28.511 |    4 | Analyze date
1338 | 12:04:28.511 |    1 | padb_fetch_sample: select count(*) from
1338 | 12:04:29.443 |    2 | padb_fetch_sample: select * from date
1338 | 12:04:31.456 |    1 | padb_fetch_sample: select * from date
1337 | 12:04:24.388 |    1 | padb_fetch_sample: select count(*) from
1337 | 12:04:24.388 |    4 | Analyze sales
1337 | 12:04:25.322 |    2 | padb_fetch_sample: select * from sales
1337 | 12:04:27.363 |    1 | padb_fetch_sample: select * from sales
...
```

# Vacuuming tables
<a name="t_Reclaiming_storage_space202"></a>

Amazon Redshift can automatically sort and perform a VACUUM DELETE operation on tables in the background. To clean up tables after a load or a series of incremental updates, you can also run the [VACUUM](r_VACUUM_command.md) command, either against the entire database or against individual tables.

**Note**  
Only users with the necessary table permissions can effectively vacuum a table. If VACUUM is run without the necessary table permissions, the operation completes successfully but has no effect. For a list of valid table permissions to effectively run VACUUM, see [VACUUM](r_VACUUM_command.md).  
For this reason, we recommend vacuuming individual tables as needed. We also recommend this approach because vacuuming the entire database is potentially an expensive operation.

## Automatic table sort
<a name="automatic-table-sort"></a>

Amazon Redshift automatically sorts data in the background to maintain table data in the order of its sort key. Amazon Redshift keeps track of your scan queries to determine which sections of the table will benefit from sorting. Amazon Redshift also keeps track of scan queries from concurrency scaling clusters. For multi cluster architectures using Amazon Redshift Data Sharing, Amazon Redshift also tracks scan queries originating from consumer clusters/workgroups in your data mesh, including clusters/workgroups across different regions. The scan statistics from main cluster, concurrency scaling clusters and consumer clusters are aggregated to determine which sections of the table will benefit from sorting.

Depending on the load on the system, Amazon Redshift automatically initiates the sort. This automatic sort lessens the need to run the VACUUM command to keep data in sort key order. If you need data fully sorted in sort key order, for example after a large data load, then you can still manually run the VACUUM command. To determine whether your table will benefit by running VACUUM SORT, monitor the `vacuum_sort_benefit` column in [SVV\$1TABLE\$1INFO](r_SVV_TABLE_INFO.md). 

Amazon Redshift tracks scan queries that use the sort key on each table. Amazon Redshift estimates the maximum percentage of improvement in scanning and filtering of data for each table (if the table was fully sorted). This estimate is visible in the `vacuum_sort_benefit` column in [SVV\$1TABLE\$1INFO](r_SVV_TABLE_INFO.md). You can use this column, along with the `unsorted` column, to determine when queries can benefit from manually running VACUUM SORT on a table. The `unsorted` column reflects the physical sort order of a table. The `vacuum_sort_benefit` column specifies the impact of sorting a table by manually running VACUUM SORT.

For example, consider the following query:

```
select "table", unsorted,vacuum_sort_benefit from svv_table_info order by 1;
```

```
 table | unsorted | vacuum_sort_benefit 
-------+----------+---------------------
 sales |    85.71 |                5.00
 event |    45.24 |               67.00
```

For the table “sales”, even though the table is \$186% physically unsorted, the query performance impact from the table being 86% unsorted is only 5%. This might be either because only a small portion of the table is accessed by queries, or very few queries accessed the table. For the table “event”, the table is \$145% physically unsorted. But the query performance impact of 67% indicates that either a larger portion of the table was accessed by queries, or the number of queries accessing the table was large. The table "event" can potentially benefit from running VACUUM SORT.

## Automatic vacuum delete
<a name="automatic-table-delete"></a>

When you perform a delete, the rows are marked for deletion, but not removed. Amazon Redshift automatically runs a VACUUM DELETE operation in the background based on the number of deleted rows in database tables. Amazon Redshift schedules the VACUUM DELETE to run during periods of reduced load and pauses the operation during periods of high load. 

**Topics**
+ [Automatic table sort](#automatic-table-sort)
+ [Automatic vacuum delete](#automatic-table-delete)
+ [VACUUM frequency](#vacuum-frequency)
+ [Sort stage and merge stage](#vacuum-stages)
+ [Vacuum threshold](#vacuum-sort-threshold)
+ [Vacuum types](#vacuum-types)
+ [Minimizing vacuum times](vacuum-managing-vacuum-times.md)

## VACUUM frequency
<a name="vacuum-frequency"></a>

You should vacuum as often as necessary to maintain consistent query performance. Consider these factors when determining how often to run your VACUUM command:
+ Run VACUUM during time periods when you expect minimal activity on the cluster, such as evenings or during designated database administration windows. 
+ Run VACUUM commands outside of maintenance windows. For more information, see [Schedule around maintenance windows](https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-avoid-maintenance.html).
+ A large unsorted region results in longer vacuum times. If you delay vacuuming, the vacuum will take longer because more data has to be reorganized. 
+ VACUUM is an I/O intensive operation, so the longer it takes for your vacuum to complete, the more impact it will have on concurrent queries and other database operations running on your cluster. 
+ VACUUM takes longer for tables that use interleaved sorting. To evaluate whether interleaved tables must be re-sorted, query the [SVV\$1INTERLEAVED\$1COLUMNS](r_SVV_INTERLEAVED_COLUMNS.md) view.

## Sort stage and merge stage
<a name="vacuum-stages"></a>

Amazon Redshift performs a vacuum operation in two stages: first, it sorts the rows in the unsorted region, then, if necessary, it merges the newly sorted rows at the end of the table with the existing rows. When vacuuming a large table, the vacuum operation proceeds in a series of steps consisting of incremental sorts followed by merges. If the operation fails or if Amazon Redshift goes offline during the vacuum, the partially vacuumed table or database will be in a consistent state, but you must manually restart the vacuum operation. Incremental sorts are lost, but merged rows that were committed before the failure do not need to be vacuumed again. If the unsorted region is large, the lost time might be significant. For more information about the sort and merge stages, see [Reduce the volume of merged rows](vacuum-managing-vacuum-times.md#vacuum-managing-volume-of-unmerged-rows).

Users can access tables while they are being vacuumed. You can perform queries and write operations while a table is being vacuumed, but when DML and a vacuum run concurrently, both might take longer. If you run UPDATE and DELETE statements during a vacuum, system performance might be reduced. Incremental merges temporarily block concurrent UPDATE and DELETE operations, and UPDATE and DELETE operations in turn temporarily block incremental merge steps on the affected tables. DDL operations, such as ALTER TABLE, are blocked until the vacuum operation finishes with the table.

**Note**  
Various modifiers to VACUUM control the way that it works. You can use them to tailor the vacuum operation for the current need. For example, using VACUUM RECLUSTER shortens the vacuum operation by not performing a full merge operation. For more information, see [VACUUM](r_VACUUM_command.md).

## Vacuum threshold
<a name="vacuum-sort-threshold"></a>

By default, VACUUM skips the sort phase for any table where more than 95 percent of the table's rows are already sorted. Skipping the sort phase can significantly improve VACUUM performance. To change the default sort threshold for a single table, include the table name and the TO *threshold* PERCENT parameter when you run the VACUUM command. 

## Vacuum types
<a name="vacuum-types"></a>

For information about different vacuum types, see [VACUUM](r_VACUUM_command.md).

# Minimizing vacuum times
<a name="vacuum-managing-vacuum-times"></a>

 Amazon Redshift automatically sorts data and runs VACUUM DELETE in the background. This lessens the need to run the VACUUM command. Vacuuming is potentially a time consuming process. Depending on the nature of your data, we recommend the following practices to minimize vacuum times.

**Topics**
+ [Decide whether to reindex](#r_vacuum-decide-whether-to-reindex)
+ [Reduce the size of the unsorted region](#r_vacuum_diskspacereqs)
+ [Reduce the volume of merged rows](#vacuum-managing-volume-of-unmerged-rows)
+ [Load your data in sort key order](#vacuum-load-in-sort-key-order)
+ [Use time series tables to reduce stored data](#vacuum-time-series-tables)

## Decide whether to reindex
<a name="r_vacuum-decide-whether-to-reindex"></a>

You can often significantly improve query performance by using an interleaved sort style, but over time performance might degrade if the distribution of the values in the sort key columns changes. 

When you initially load an empty interleaved table using COPY or CREATE TABLE AS, Amazon Redshift automatically builds the interleaved index. If you initially load an interleaved table using INSERT, you need to run VACUUM REINDEX afterwards to initialize the interleaved index. 

Over time, as you add rows with new sort key values, performance might degrade if the distribution of the values in the sort key columns changes. If your new rows fall primarily within the range of existing sort key values, you don’t need to reindex. Run VACUUM SORT ONLY or VACUUM FULL to restore the sort order. 

The query engine is able to use sort order to efficiently select which data blocks need to be scanned to process a query. For an interleaved sort, Amazon Redshift analyzes the sort key column values to determine the optimal sort order. If the distribution of key values changes, or skews, as rows are added, the sort strategy will no longer be optimal, and the performance benefit of sorting will degrade. To reanalyze the sort key distribution you can run a VACUUM REINDEX. The reindex operation is time consuming, so to decide whether a table will benefit from a reindex, query the [SVV\$1INTERLEAVED\$1COLUMNS](r_SVV_INTERLEAVED_COLUMNS.md) view. 

For example, the following query shows details for tables that use interleaved sort keys.

```
select tbl as tbl_id, stv_tbl_perm.name as table_name, 
col, interleaved_skew, last_reindex
from svv_interleaved_columns, stv_tbl_perm
where svv_interleaved_columns.tbl = stv_tbl_perm.id
and interleaved_skew is not null;


 tbl_id | table_name | col | interleaved_skew | last_reindex
--------+------------+-----+------------------+--------------------
 100048 | customer   |   0 |             3.65 | 2015-04-22 22:05:45
 100068 | lineorder  |   1 |             2.65 | 2015-04-22 22:05:45
 100072 | part       |   0 |             1.65 | 2015-04-22 22:05:45
 100077 | supplier   |   1 |             1.00 | 2015-04-22 22:05:45
(4 rows)
```

The value for `interleaved_skew` is a ratio that indicates the amount of skew. A value of 1 means that there is no skew. If the skew is greater than 1.4, a VACUUM REINDEX will usually improve performance unless the skew is inherent in the underlying set. 

You can use the date value in `last_reindex` to determine how long it has been since the last reindex. 

## Reduce the size of the unsorted region
<a name="r_vacuum_diskspacereqs"></a>

The unsorted region grows when you load large amounts of new data into tables that already contain data or when you do not vacuum tables as part of your routine maintenance operations. To avoid long-running vacuum operations, use the following practices:
+ Run vacuum operations on a regular schedule. 

  If you load your tables in small increments (such as daily updates that represent a small percentage of the total number of rows in the table), running VACUUM regularly will help ensure that individual vacuum operations go quickly.
+ Run the largest load first.

  If you need to load a new table with multiple COPY operations, run the largest load first. When you run an initial load into a new or truncated table, all of the data is loaded directly into the sorted region, so no vacuum is required.
+ Truncate a table instead of deleting all of the rows. 

  Deleting rows from a table does not reclaim the space that the rows occupied until you perform a vacuum operation; however, truncating a table empties the table and reclaims the disk space, so no vacuum is required. Alternatively, drop the table and re-create it. 
+ Truncate or drop test tables. 

  If you are loading a small number of rows into a table for test purposes, don't delete the rows when you are done. Instead, truncate the table and reload those rows as part of the subsequent production load operation. 
+ Perform a deep copy. 

  If a table that uses a compound sort key table has a large unsorted region, a deep copy is much faster than a vacuum. A deep copy recreates and repopulates a table by using a bulk insert, which automatically re-sorts the table. If a table has a large unsorted region, a deep copy is much faster than a vacuum. The trade off is that you cannot make concurrent updates during a deep copy operation, which you can do during a vacuum. For more information, see [Amazon Redshift best practices for designing queries](c_designing-queries-best-practices.md). 

## Reduce the volume of merged rows
<a name="vacuum-managing-volume-of-unmerged-rows"></a>

If a vacuum operation needs to merge new rows into a table's sorted region, the time required for a vacuum will increase as the table grows larger. You can improve vacuum performance by reducing the number of rows that must be merged. 

Before a vacuum, a table consists of a sorted region at the head of the table, followed by an unsorted region, which grows whenever rows are added or updated. When a set of rows is added by a COPY operation, the new set of rows is sorted on the sort key as it is added to the unsorted region at the end of the table. The new rows are ordered within their own set, but not within the unsorted region. 

The following diagram illustrates the unsorted region after two successive COPY operations, where the sort key is CUSTID. For simplicity, this example shows a compound sort key, but the same principles apply to interleaved sort keys, except that the impact of the unsorted region is greater for interleaved tables. 

![\[An unsorted table holding records from two COPY operations.\]](http://docs.aws.amazon.com/redshift/latest/dg/images/vacuum-unsorted-region.png)


A vacuum restores the table's sort order in two stages:

1. Sort the unsorted region into a newly-sorted region. 

   The first stage is relatively cheap, because only the unsorted region is rewritten. If the range of sort key values of the newly sorted region is higher than the existing range, only the new rows need to be rewritten, and the vacuum is complete. For example, if the sorted region contains ID values 1 to 500 and subsequent copy operations add key values greater than 500, then only the unsorted region needs to be rewritten. 

1. Merge the newly-sorted region with the previously-sorted region. 

   If the keys in the newly sorted region overlap the keys in the sorted region, then VACUUM needs to merge the rows. Starting at the beginning of the newly-sorted region (at the lowest sort key), the vacuum writes the merged rows from the previously sorted region and the newly sorted region into a new set of blocks. 

The extent to which the new sort key range overlaps the existing sort keys determines the extent to which the previously-sorted region will need to be rewritten. If the unsorted keys are scattered throughout the existing sort range, a vacuum might need to rewrite existing portions of the table. 

The following diagram shows how a vacuum would sort and merge rows that are added to a table where CUSTID is the sort key. Because each copy operation adds a new set of rows with key values that overlap the existing keys, almost the entire table needs to be rewritten. The diagram shows single sort and merge, but in practice, a large vacuum consists of a series of incremental sort and merge steps. 

![\[A VACUUM operation on the example table in two steps. First the new rows are sorted, then they are merged with the existing rows.\]](http://docs.aws.amazon.com/redshift/latest/dg/images/vacuum-unsorted-region-sort-merge.png)


If the range of sort keys in a set of new rows overlaps the range of existing keys, the cost of the merge stage continues to grow in proportion to the table size as the table grows while the cost of the sort stage remains proportional to the size of the unsorted region. In such a case, the cost of the merge stage overshadows the cost of the sort stage, as the following diagram shows.

![\[Diagram showing how the merge stage becomes more costly when new rows have sort keys overlapping with existing rows.\]](http://docs.aws.amazon.com/redshift/latest/dg/images/vacuum-example-merge-region-grows.png)


To determine what proportion of a table was remerged, query SVV\$1VACUUM\$1SUMMARY after the vacuum operation completes. The following query shows the effect of six successive vacuums as CUSTSALES grew larger over time.

```
select * from svv_vacuum_summary
where table_name = 'custsales';


 table_name | xid  | sort_      | merge_     | elapsed_   | row_  | sortedrow_ | block_  | max_merge_
            |      | partitions | increments | time       | delta | delta      | delta   | partitions
 -----------+------+------------+------------+------------+-------+------------+---------+---------------
  custsales | 7072 |          3 |          2 |  143918314 |     0 |   88297472 |   1524  |      47
  custsales | 7122 |          3 |          3 |  164157882 |     0 |   88297472 |    772  |      47
  custsales | 7212 |          3 |          4 |  187433171 |     0 |   88297472 |    767  |      47
  custsales | 7289 |          3 |          4 |  255482945 |     0 |   88297472 |    770  |      47
  custsales | 7420 |          3 |          5 |  316583833 |     0 |   88297472 |    769  |      47
  custsales | 9007 |          3 |          6 |  306685472 |     0 |   88297472 |    772  |      47
 (6 rows)
```

The merge\$1increments column gives an indication of the amount of data that was merged for each vacuum operation. If the number of merge increments over consecutive vacuums increases in proportion to the growth in table size, it indicates that each vacuum operation is remerging an increasing number of rows in the table because the existing and newly sorted regions overlap. 

## Load your data in sort key order
<a name="vacuum-load-in-sort-key-order"></a>

If you load your data in sort key order using a COPY command, you might reduce or even remove the need to vacuum. 

COPY automatically adds new rows to the table's sorted region when all of the following are true:
+ The table uses a compound sort key with only one sort column. 
+ The sort column is NOT NULL. 
+ The table is 100 percent sorted or empty. 
+ All the new rows are higher in sort order than the existing rows, including rows marked for deletion. In this instance, Amazon Redshift uses the first eight bytes of the sort key to determine sort order.
+  The COPY command does not trigger certain load optimizations. When loading large volumes of data, Amazon Redshift might optimize performance by creating new sorted partitions rather than adding rows to the table's sorted region. 

For example, suppose you have a table that records customer events using a customer ID and time. If you sort on customer ID, it’s likely that the sort key range of new rows added by incremental loads will overlap the existing range, as shown in the previous example, leading to an expensive vacuum operation. 

If you set your sort key to a timestamp column, your new rows will be appended in sort order at the end of the table, as the following diagram shows, reducing or even removing the need to vacuum.

![\[A table that uses a timestamp column as the sort key, getting new records that don't need to be sorted.\]](http://docs.aws.amazon.com/redshift/latest/dg/images/vacuum-unsorted-region-date-sort.png)


## Use time series tables to reduce stored data
<a name="vacuum-time-series-tables"></a>

If you maintain data for a rolling time period, use a series of tables, as the following diagram illustrates.

![\[Five tables with data from five quarters. The oldest table is deleted to maintain a year of rolling time.\]](http://docs.aws.amazon.com/redshift/latest/dg/images/vacuum-example-unsorted-region-copy-time-series.png)


Create a new table each time you add a set of data, then delete the oldest table in the series. You gain a double benefit: 
+ You avoid the added cost of deleting rows, because a DROP TABLE operation is much more efficient than a mass DELETE.
+ If the tables are sorted by timestamp, no vacuum is needed. If each table contains data for one month, a vacuum will at most have to rewrite one month’s worth of data, even if the tables are not sorted by timestamp.

You can create a UNION ALL view for use by reporting queries that hides the fact that the data is stored in multiple tables. If a query filters on the sort key, the query planner can efficiently skip all the tables that aren't used. A UNION ALL can be less efficient for other types of queries, so you should evaluate query performance in the context of all queries that use the tables.

# Managing concurrent write operations
<a name="c_Concurrent_writes"></a>

Some applications require not only concurrent querying and loading, but also the ability to write to multiple tables or the same table concurrently. In this context, *concurrently* means overlapping, not scheduled to run at precisely the same time. Two transactions are considered to be concurrent if the second one starts before the first commits. Concurrent operations can originate from different sessions that are controlled either by the same user or by different users. 

Amazon Redshift supports these types of applications by allowing tables to be read while they are being incrementally loaded or modified. Queries simply see the latest committed version, or *snapshot*, of the data, rather than waiting for the next version to be committed. If you want a particular query to wait for a commit from another write operation, you have to schedule it accordingly.

**Note**  
Amazon Redshift supports a default *automatic commit* behavior in which each separately run SQL command commits individually. If you enclose a set of commands in a transaction block (defined by [BEGIN](r_BEGIN.md) and [END](r_END.md) statements), the block commits as one transaction, so you can roll it back if necessary. Exceptions to this behavior are the TRUNCATE and VACUUM commands, which automatically commit all outstanding changes made in the current transaction.   
Some SQL clients issue BEGIN and COMMIT commands automatically, so the client controls whether a group of statements are run as a transaction or each individual statement is run as its own transaction. Check the documentation for the interface you are using. For example, when using the Amazon Redshift JDBC driver, a JDBC `PreparedStatement` with a query string that contains multiple (semicolon separated) SQL commands runs all the statements as a single transaction. In contrast, if you use SQL Workbench/J and set AUTO COMMIT ON, then if you run multiple statements, each statement runs as its own transaction. 

The following topics describe some of the key concepts and use cases that involve transactions, database snapshots, updates, and concurrent behavior.

**Topics**
+ [Isolation levels in Amazon Redshift](c_serial_isolation.md)
+ [Write and read/write operations](c_write_readwrite.md)
+ [Concurrent write examples](r_Serializable_isolation_example.md)
+ [Troubleshooting serializable isolation errors](c_serial_isolation-serializable-isolation-troubleshooting.md)

# Isolation levels in Amazon Redshift
<a name="c_serial_isolation"></a>

Concurrent write operations are supported in Amazon Redshift in a protective way, using write locks on tables and the principle of *serializable isolation*. Serializable isolation preserves the illusion that a transaction running against a table is the only transaction that is running against that table.

Amazon Redshift databases support concurrent write operations by having each operation use the latest committed version, or snapshot, of their data at the start of the transaction. A database snapshot is created within a transaction on the first occurrence of most SELECT statements, DML commands such as COPY, DELETE, INSERT, UPDATE, and TRUNCATE, and the following DDL commands:
+  ALTER TABLE (to add or drop columns) 
+  CREATE TABLE 
+  DROP TABLE 
+  TRUNCATE TABLE 

No other transaction is able to change this snapshot, meaning transactions are isolated from one another. That is, concurrent transactions are invisible to each other and can’t detect each other's changes.

ny concurrent execution of transactions must produce the same results as the serial execution of those transactions. If no serial execution of those transactions can produce the same results, the transaction that runs a statement that might break the ability to serialize is stopped and rolled back.

For example, suppose that a user attempts to run two concurrent transactions, T1 and T2. Running T1 and T2 must produce the same results as at least one of the following scenarios:
+ T1 and T2 run serially in that order.
+ T2 and T1 run serially in that order.

 Isolation levels in Amazon Redshift prevent the following problems: 
+  Dirty reads ‐ A dirty read occurs when a transaction reads data that has not yet been committed. For example, suppose transaction 1 updates a row. Transaction 2 reads the updated row before T1 commits the update. If T1 rolls back the change, T2 will have read data in uncommitted rows that Amazon Redshift now considers to never have existed. 
+  Non-repeatable reads ‐ A non-repeatable read occurs when a single transaction reads the same row twice but gets different data each time. For example, suppose transaction 1 reads a row. Transaction 2 updates or deletes that row and commits the update or delete. If T1 rereads the row, it retrieves different row values or discovers that the row has been deleted. 
+  Phantoms – A phantom is a row that matches the search criteria but is not initially seen. For example, suppose transaction 1 reads a set of rows that satisfies its search criteria. Transaction 2 generates a new row in an UPDATE or INSERT statement that matches the search criteria for T1. If T1 reruns its search statement, it gets a different set of rows. 

## SNAPSHOT and SERIALIZABLE isolation
<a name="c_serial_isolation-snapshot_and_serializable"></a>

SNAPSHOT and SERIALIZABLE isolation are the two serializable isolation levels available in Amazon Redshift. 

SNAPSHOT isolation is the default isolation level when creating provisioned clusters and serverless workgroups, letting you process larger volumes of data than SERIALIZABLE isolation in less time.

SERIALIZABLE isolation takes more time, but implements stricter constraints on concurrent transactions. This isolation level prevents problems such as write-skew anomalies by only allowing one transaction to commit, while canceling all other concurrent transaction with an serializable isolation violation error.

Following is a timeline example of how two concurrent write operations would be handled when using SNAPSHOT isolation. Each user’s UPDATE statement is allowed to commit because they don’t conflict by attempting to update the same rows.

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/redshift/latest/dg/c_serial_isolation.html)

If the same scenario is run using serializable isolation, then Amazon Redshift terminates user 2 due to a serializable violation and returns error `1023`. For more information, see [Troubleshooting serializable isolation errors](c_serial_isolation-serializable-isolation-troubleshooting.md). In this case, only user 1 can commit successfully. 

## Considerations
<a name="c_serial_isolation-considerations"></a>

When using isolation levels in Amazon Redshift, consider the following:
+  Query the STV\$1DB\$1ISOLATION\$1LEVEL catalog view to view which isolation level your database is using. For more information, see [STV\$1DB\$1ISOLATION\$1LEVEL](r_STV_DB_ISOLATION_LEVEL.md). 
+  Query the PG\$1DATABASE\$1INFO view to see how many concurrent transactions are supported for your database. For more information, see [PG\$1DATABASE\$1INFO](r_PG_DATABASE_INFO.md). 
+  System catalog tables (PG) and other Amazon Redshift system tables aren't locked in a transaction. Therefore, changes to database objects that arise from DDL and TRUNCATE operations are visible on commit to any concurrent transactions. 

   For example, suppose that table A exists in the database when two concurrent transactions, T1 and T2, start. Suppose that T2 returns a list of tables by selecting from the PG\$1TABLES catalog table. Then T1 drops table A and commits, and then T2 lists the tables again. Table A is now no longer listed. If T2 tries to query the dropped table, Amazon Redshift returns a "relation does not exist" error. The catalog query that returns the list of tables to T2 or checks that table A exists isn't subject to the same isolation rules as operations performed on user tables. 

   Transactions for updates to these tables run in a read committed isolation mode. 
+  PG-prefix catalog tables don't support SNAPSHOT isolation. 

# Write and read/write operations
<a name="c_write_readwrite"></a>

You can manage the specific behavior of concurrent write operations by deciding when and how to run different types of commands. The following commands are relevant to this discussion: 
+ COPY commands, which perform loads (initial or incremental)
+ INSERT commands that append one or more rows at a time
+ UPDATE commands, which modify existing rows
+ DELETE commands, which remove rows 

COPY and INSERT operations are pure write operations. DELETE and UPDATE operations are read/write operations (for rows to be deleted or updated, they have to be read first). The results of concurrent write operations depend on the specific commands that are being run concurrently. 

UPDATE and DELETE operations behave differently because they rely on an initial table read before they do any writes. Given that concurrent transactions are invisible to each other, both UPDATEs and DELETEs have to read a snapshot of the data from the last commit. When the first UPDATE or DELETE releases its lock, the second UPDATE or DELETE needs to determine whether the data that it is going to work with is potentially stale. It will not be stale, because the second transaction does not obtain its snapshot of data until after the first transaction has released its lock.

## Potential deadlock situation for concurrent write transactions involving multiple tables
<a name="c_write_readwrite-potential-deadlock"></a>

When transactions involve updates of more than one table, there is always the possibility of concurrently running transactions becoming deadlocked when they both try to write to the same set of tables. A transaction releases all of its table locks at once when it either commits or rolls back; it doesn't relinquish locks one at a time.

For example, suppose that transactions T1 and T2 start at roughly the same time. If T1 starts writing to table A and T2 starts writing to table B, both transactions can proceed without conflict. However, if T1 finishes writing to table A and needs to start writing to table B, it won’t be able to proceed because T2 still holds the lock on B. Similarly, if T2 finishes writing to table B and needs to start writing to table A, it will not be able to proceed either because T1 still holds the lock on A. Because neither transaction can release its locks until all its write operations are committed, neither transaction can proceed. To avoid this kind of deadlock, you need to schedule concurrent write operations carefully. For example, you should always update tables in the same order in transactions and, if specifying locks, lock tables in the same order before you perform any DML operations. 

## Potential deadlock situation for concurrent write transactions involving a single table
<a name="c_write_readwrite-potential-deadlock-single"></a>

In a snapshot isolation environment, deadlocks can occur when running concurrent write transactions on the same table. The snapshot isolation deadlock happens when concurrent INSERT or COPY statements are sharing a lock and making progress, and another statement needs to perform an operation (UPDATE, DELETE, MERGE, or DDL operation) that requires an exclusive lock on the same table. 

Consider the following scenario:

Transaction 1 (T1):

```
INSERT/COPY INTO table_A;
```

Transaction 2 (T2):

```
INSERT/COPY INTO table_A; 
            <UPDATE/DELETE/MERGE/DDL statement> table_A
```

A deadlock can occur when multiple transactions with INSERT or COPY operations are running concurrently on the same table with a shared lock, and one of those transactions follows its pure write operation with an operation that requires an exclusive lock, such as an UPDATE, MERGE, DELETE, or DDL statement.

To avoid the deadlock in these situations, you can separate statements requiring an exclusive lock (UPDATE/MERGE/DELETE/DDL statements) to a different transaction so that any INSERT/COPY statements can progress simultaneously, and the statements requiring exclusive locks can execute after them. Alternatively, for transactions with INSERT/COPY statements and MERGE/UPDATE/MERGE statements on same table, you can include retry logic in your applications to work around potential deadlocks. 

# Concurrent write examples
<a name="r_Serializable_isolation_example"></a>

The following pseudo-code examples demonstrate how transactions either proceed or wait when they are run concurrently.

## Concurrent write examples with serializable isolation
<a name="r_Serializable_isolation_example-serializable"></a>

### Concurrent COPY operations into the same table with serializable isolation
<a name="r_Serializable_isolation_example-concurrent-copy-operations-into-the-same-table"></a>

Transaction 1 copies rows into the LISTING table: 

```
begin;
copy listing from ...;
end;
```

Transaction 2 starts concurrently in a separate session and attempts to copy more rows into the LISTING table. Transaction 2 must wait until transaction 1 releases the write lock on the LISTING table, then it can proceed. 

```
begin;
[waits]
copy listing from ;
end;
```

The same behavior would occur if one or both transactions contained an INSERT command instead of a COPY command.

### Concurrent DELETE operations from the same table with serializable isolation
<a name="r_Serializable_isolation_example-concurrent-delete-operations-from-the-same-table"></a>

Transaction 1 deletes rows from a table: 

```
begin;
delete from listing where ...;
end;
```

Transaction 2 starts concurrently and attempts to delete rows from the same table. It will succeed because it waits for transaction 1 to complete before attempting to delete rows.

```
begin
[waits]
delete from listing where ;
end;
```

The same behavior would occur if one or both transactions contained an UPDATE command to the same table instead of a DELETE command.

### Concurrent transactions with a mixture of read and write operations with serializable isolation
<a name="r_Serializable_isolation_example-concurrent-transactions"></a>

In this example, transaction 1 deletes rows from the USERS table, reloads the table, runs a COUNT(\$1) query, and then ANALYZE, before committing: 

```
begin;
delete one row from USERS table;
copy ;
select count(*) from users;
analyze ;
end;
```

Meanwhile, transaction 2 starts. This transaction attempts to copy additional rows into the USERS table, analyze the table, and then run the same COUNT(\$1) query as the first transaction:

```
begin;
[waits]
copy users from ...;
select count(*) from users;
analyze;
end;
```

The second transaction will succeed because it must wait for the first to complete. Its COUNT query will return the count based on the load it has completed.

## Concurrent write examples with snapshot isolation
<a name="r_Serializable_isolation_example-snapshot"></a>

### Concurrent COPY operations into the same table with snapshot isolation
<a name="r_Serializable_isolation_example-concurrent-copy-operations-into-the-same-table-snapshot"></a>

Transaction 1 copies rows into the LISTING table:

```
begin;
copy listing from ...;
end;
```

Transaction 2 starts concurrently in a separate session and attempts to copy more rows into the LISTING table. Transaction 2 can progress simultaneously until either transaction needs to write data to the target table `listing`, at which point they will run sequentially. 

```
begin; 
//When the COPY statement from T1 needs to write data to the table, the COPY statement from T2 waits.
copy listing from ...; 
end;
```

The same behavior would occur if one or both transactions contained an INSERT command instead of a COPY command.

### Concurrent DELETE operations from the same table with snapshot isolation
<a name="r_Serializable_isolation_example-concurrent-delete-operations-from-the-same-table-snapshot"></a>

Concurrent DELETE or UPDATE operations from the same table with snapshot isolation run the same as operations run with serializable isolation.

### Concurrent transactions with a mixture of read and write operations with snapshot isolation
<a name="r_Serializable_isolation_example-concurrent-transactions-snapshot"></a>

Concurrent transactions that are run with mixes of operations with snapshot isolation run the same as transactions with mixes of operations that are run with serializable isolation.

# Troubleshooting serializable isolation errors
<a name="c_serial_isolation-serializable-isolation-troubleshooting"></a>

## ERROR:1023 DETAIL: Serializable isolation violation on a table in Redshift
<a name="c_serial_isolation-serialization-isolation-1023"></a>

When Amazon Redshift detects a serializable isolation error, you see an error message such as the following.

```
ERROR:1023 DETAIL: Serializable isolation violation on table in Redshift
```

To address a serializable isolation error, you can try the following methods:
+ Retry the canceled transaction.

   Amazon Redshift detected that a concurrent workload is not serializable. It suggests gaps in the application logic, which can usually be worked around by retrying the transaction that encountered the error. If the issue persists, try one of the other methods. 
+ Move any operations that don't have to be in the same atomic transaction outside of the transaction.

  This method applies when individual operations inside two transactions cross-reference each other in a way that can affect the outcome of the other transaction. For example, the following two sessions each start a transaction. 

  ```
  Session1_Redshift=# begin;
  ```

  ```
  Session2_Redshift=# begin;
  ```

  The result of a SELECT statement in each transaction might be affected by an INSERT statement in the other. In other words, suppose that you run the following statements serially, in any order. In every case, the result is one of the SELECT statements returning one more row than if the transactions were run concurrently. There is no order in which the operations can run serially that produces the same result as when run concurrently. Thus, the last operation that is run results in a serializable isolation error.

  ```
  Session1_Redshift=# select * from tab1;
  Session1_Redshift=# insert into tab2 values (1);
  ```

  ```
  Session2_Redshift=# insert into tab1 values (1);
  Session2_Redshift=# select * from tab2;
  ```

  In many cases, the result of the SELECT statements isn't important. In other words, the atomicity of the operations in the transactions isn't important. In these cases, move the SELECT statements outside of their transactions, as shown in the following examples.

  ```
  Session1_Redshift=# begin;
  Session1_Redshift=# insert into tab1 values (1)
  Session1_Redshift=# end;
  Session1_Redshift=# select * from tab2;
  ```

  ```
  Session2_Redshift # select * from tab1;
  Session2_Redshift=# begin;
  Session2_Redshift=# insert into tab2 values (1)
  Session2_Redshift=# end;
  ```

  In these examples, there are no cross-references in the transactions. The two INSERT statements don't affect each other. In these examples, there is at least one order in which the transactions can run serially and produce the same result as if run concurrently. This means that the transactions are serializable.
+ Force serialization by locking all tables in each session.

  The [LOCK](r_LOCK.md) command blocks operations that can result in serializable isolation errors. When you use the LOCK command, be sure to do the following:
  + Lock all tables affected by the transaction, including those affected by read-only SELECT statements inside the transaction.
  + Lock tables in the same order, regardless of the order that operations are performed in.
  + Lock all tables at the beginning of the transaction, before performing any operations.
+ Use snapshot isolation for concurrent transactions

  Use an ALTER DATABASE command with snapshot isolation. For more information about the SNAPSHOT parameter for ALTER DATABASE, see [Parameters](r_ALTER_DATABASE.md#r_ALTER_DATABASE-parameters).

## ERROR:1018 DETAIL: Relation does not exist
<a name="c_serial_isolation-serialization-isolation-1018"></a>

When you run concurrent Amazon Redshift operations in different sessions, you see an error message such as the following.

```
ERROR: 1018 DETAIL: Relation does not exist.
```

Transactions in Amazon Redshift follow snapshot isolation. After a transaction begins, Amazon Redshift takes a snapshot of the database. For the entire lifecycle of the transaction, the transaction operates on the state of the database as reflected in the snapshot. If the transaction reads from a table that doesn't exist in the snapshot, it throws the 1018 error message shown previously. Even when another concurrent transaction creates a table after the transaction has taken the snapshot, the transaction can't read from the newly created table.

To address this serialization isolation error, you can try to move the start of the transaction to a point where you know the table exists.

If the table is created by another transaction, this point is at least after that transaction has been committed. Also, ensure that no concurrent transaction has been committed that might have dropped the table.

```
session1 = # BEGIN;
session1 = # DROP TABLE A;
session1 = # COMMIT;
```

```
session2 = # BEGIN;
```

```
session3 = # BEGIN;
session3 = # CREATE TABLE A (id INT);
session3 = # COMMIT;
```

```
session2 = # SELECT * FROM A;
```

The last operation that is run as the read operation by session2 results in a serializable isolation error. This error happens when session2 takes a snapshot and the table has already been dropped by a committed session1. In other words, even though a concurrent session3 has created the table, session2 doesn't see the table because it's not in the snapshot.

To resolve this error, you can reorder the sessions as follows.

```
session1 = # BEGIN;
session1 = # DROP TABLE A;
session1 = # COMMIT;
```

```
session3 = # BEGIN;
session3 = # CREATE TABLE A (id INT);
session3 = # COMMIT;
```

```
session2 = # BEGIN;
session2 = # SELECT * FROM A;
```

Now when session2 takes its snapshot, session3 has already been committed, and the table is in the database. Session2 can read from the table without any error.

# Tutorial: Loading data from Amazon S3
<a name="tutorial-loading-data"></a>

In this tutorial, you walk through the process of loading data into your Amazon Redshift database tables from data files in an Amazon S3 bucket from beginning to end. 

In this tutorial, you do the following: 
+ Download data files that use comma-separated value (CSV), character-delimited, and fixed width formats. 
+ Create an Amazon S3 bucket and then upload the data files to the bucket. 
+ Launch an Amazon Redshift cluster and create database tables. 
+ Use COPY commands to load the tables from the data files on Amazon S3. 
+ Troubleshoot load errors and modify your COPY commands to correct the errors.

## Prerequisites
<a name="tutorial-loading-data-prerequisites"></a>

You need the following prerequisites:
+ An AWS account to launch an Amazon Redshift cluster and to create a bucket in Amazon S3.
+ Your AWS credentials (IAM role) to load test data from Amazon S3. If you need a new IAM role, go to [Creating IAM roles](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create.html).
+ An SQL client such as the Amazon Redshift console query editor. 

This tutorial is designed so that it can be taken by itself. In addition to this tutorial, we recommend completing the following tutorials to gain a more complete understanding of how to design and use Amazon Redshift databases: 
+ [Amazon Redshift Getting Started Guide](https://docs.aws.amazon.com/redshift/latest/gsg/) walks you through the process of creating an Amazon Redshift cluster and loading sample data. 

## Overview
<a name="tutorial-loading-data-overview"></a>

You can add data to your Amazon Redshift tables either by using an INSERT command or by using a COPY command. At the scale and speed of an Amazon Redshift data warehouse, the COPY command is many times faster and more efficient than INSERT commands. 

The COPY command uses the Amazon Redshift massively parallel processing (MPP) architecture to read and load data in parallel from multiple data sources. You can load from data files on Amazon S3, Amazon EMR, or any remote host accessible through a Secure Shell (SSH) connection. Or you can load directly from an Amazon DynamoDB table. 

In this tutorial, you use the COPY command to load data from Amazon S3. Many of the principles presented here apply to loading from other data sources as well. 

To learn more about using the COPY command, see these resources: 
+ [Amazon Redshift best practices for loading data](c_loading-data-best-practices.md)
+ [Loading data from Amazon EMR](loading-data-from-emr.md)
+ [Loading data from remote hosts](loading-data-from-remote-hosts.md)
+ [Loading data from an Amazon DynamoDB table](t_Loading-data-from-dynamodb.md)

## Step 1: Create a cluster
<a name="tutorial-loading-data-launch-cluster"></a>

If you already have a cluster that you want to use, you can skip this step. 

For the exercises in this tutorial, use a four-node cluster. 

**To create a cluster**

1. Sign in to the AWS Management Console and open the Amazon Redshift console at [https://console.aws.amazon.com/redshiftv2/](https://console.aws.amazon.com/redshiftv2/).

   Using the navigation menu, choose the **Provisioned clusters dashboard**.
**Important**  
Make sure that you have the necessary permissions to perform the cluster operations. For information on granting the necessary permissions, see [ Authorizing Amazon Redshift to access AWS services](https://docs.aws.amazon.com/redshift/latest/mgmt/authorizing-redshift-service.html).

1. At top right, choose the AWS Region in which you want to create the cluster. For the purposes of this tutorial, choose **US West (Oregon)**.

1. On the navigation menu, choose **Clusters**, then choose **Create cluster**. The **Create cluster** page appears. 

1. On the **Create cluster** page enter parameters for your cluster. Choose your own values for the parameters, except change the following values:
   + Choose **dc2.large** for the node type.
   + Choose **4** for the **Number of nodes**.
   + In the **Cluster permissions** section, choose an IAM role from **Available IAM roles**. This role should be one that you previously created and that has access to Amazon S3. Then choose **Associate IAM role** to add it to the list of **Associated IAM roles** for the cluster.

1. Choose **Create cluster**. 

Follow the [Amazon Redshift Getting Started Guide](https://docs.aws.amazon.com/redshift/latest/gsg/) steps to connect to your cluster from a SQL client and test a connection. You don't need to complete the remaining Getting Started steps to create tables, upload data, and try example queries. 

## Step 2: Download the data files
<a name="tutorial-loading-data-download-files"></a>

In this step, you download a set of sample data files to your computer. In the next step, you upload the files to an Amazon S3 bucket.

**To download the data files**

1. Download the zipped file: [LoadingDataSampleFiles.zip](samples/LoadingDataSampleFiles.zip). 

1. Extract the files to a folder on your computer.

1. Verify that your folder contains the following files. 

   ```
   customer-fw-manifest
   customer-fw.tbl-000
   customer-fw.tbl-000.bak
   customer-fw.tbl-001
   customer-fw.tbl-002
   customer-fw.tbl-003
   customer-fw.tbl-004
   customer-fw.tbl-005
   customer-fw.tbl-006
   customer-fw.tbl-007
   customer-fw.tbl.log
   dwdate-tab.tbl-000
   dwdate-tab.tbl-001
   dwdate-tab.tbl-002
   dwdate-tab.tbl-003
   dwdate-tab.tbl-004
   dwdate-tab.tbl-005
   dwdate-tab.tbl-006
   dwdate-tab.tbl-007
   part-csv.tbl-000
   part-csv.tbl-001
   part-csv.tbl-002
   part-csv.tbl-003
   part-csv.tbl-004
   part-csv.tbl-005
   part-csv.tbl-006
   part-csv.tbl-007
   ```

## Step 3: Upload the files to an Amazon S3 bucket
<a name="tutorial-loading-data-upload-files"></a>

In this step, you create an Amazon S3 bucket and upload the data files to the bucket.

### 
<a name="tutorial-loading-data-to-upload-files"></a>

**To upload the files to an Amazon S3 bucket**

1. Create a bucket in Amazon S3.

   For more information about creating a bucket, see [Creating a bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html) in the *Amazon Simple Storage Service User Guide*.

   1. Sign in to the AWS Management Console and open the Amazon S3 console at [https://console.aws.amazon.com/s3/](https://console.aws.amazon.com/s3/).

   1. Choose **Create bucket**.

   1. Choose an AWS Region. 

      Create the bucket in the same Region as your cluster. If your cluster is in the US West (Oregon) Region, choose **US West (Oregon) Region (us-west-2)**.

   1. In the **Bucket Name** box of the **Create bucket** dialog box, enter a bucket name. 

      The bucket name you choose must be unique among all existing bucket names in Amazon S3. One way to help ensure uniqueness is to prefix your bucket names with the name of your organization. Bucket names must comply with certain rules. For more information, go to [Bucket restrictions and limitations](https://docs.aws.amazon.com/AmazonS3/latest/userguide/BucketRestrictions.html) in the *Amazon Simple Storage Service User Guide.* 

   1. Choose the recommended defaults for the rest of the options.

   1. Choose **Create bucket**. 

      When Amazon S3 successfully creates your bucket, the console displays your empty bucket in the **Buckets** panel. 

1. Create a folder.

   1. Choose the name of the new bucket.

   1. Choose the **Create Folder** button.

   1. Name the new folder **load**.
**Note**  
The bucket that you created is not in a sandbox. In this exercise, you add objects to a real bucket. You're charged a nominal amount for the time that you store the objects in the bucket. For more information about Amazon S3 pricing, go to the [Amazon S3 pricing](https://aws.amazon.com/s3/pricing/) page.

1. Upload the data files to the new Amazon S3 bucket.

   1. Choose the name of the data folder.

   1. In the Upload wizard, choose **Add files**.

      Follow the Amazon S3 console instructions to upload all of the files you downloaded and extracted,

   1. Choose **Upload**.
<a name="tutorial-loading-user-credentials"></a>
**User Credentials**  
The Amazon Redshift COPY command must have access to read the file objects in the Amazon S3 bucket. If you use the same user credentials to create the Amazon S3 bucket and to run the Amazon Redshift COPY command, the COPY command has all necessary permissions. If you want to use different user credentials, you can grant access by using the Amazon S3 access controls. The Amazon Redshift COPY command requires at least ListBucket and GetObject permissions to access the file objects in the Amazon S3 bucket. For more information about controlling access to Amazon S3 resources, go to [Managing access permissions to your Amazon S3 resources](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-access-control.html).

## Step 4: Create the sample tables
<a name="tutorial-loading-data-create-tables"></a>

For this tutorial, you use a set of tables based on the Star Schema Benchmark (SSB) schema. The following diagram shows the SSB data model. 

![\[The SSB schema's five tables and their relationships to one another.\]](http://docs.aws.amazon.com/redshift/latest/dg/images/tutorial-optimize-tables-ssb-data-model.png)


The SSB tables might already exist in the current database. If so, drop the tables to remove them from the database before you create them using the CREATE TABLE commands in the next step. The tables used in this tutorial might have different attributes than the existing tables.

**To create the sample tables**

1. To drop the SSB tables, run the following commands in your SQL client.

   ```
   drop table part cascade;
   drop table supplier;
   drop table customer;
   drop table dwdate;
   drop table lineorder;
   ```

1. Run the following CREATE TABLE commands in your SQL client. 

   ```
   CREATE TABLE part 
   (
     p_partkey     INTEGER NOT NULL,
     p_name        VARCHAR(22) NOT NULL,
     p_mfgr        VARCHAR(6),
     p_category    VARCHAR(7) NOT NULL,
     p_brand1      VARCHAR(9) NOT NULL,
     p_color       VARCHAR(11) NOT NULL,
     p_type        VARCHAR(25) NOT NULL,
     p_size        INTEGER NOT NULL,
     p_container   VARCHAR(10) NOT NULL
   );
   
   CREATE TABLE supplier 
   (
     s_suppkey   INTEGER NOT NULL,
     s_name      VARCHAR(25) NOT NULL,
     s_address   VARCHAR(25) NOT NULL,
     s_city      VARCHAR(10) NOT NULL,
     s_nation    VARCHAR(15) NOT NULL,
     s_region    VARCHAR(12) NOT NULL,
     s_phone     VARCHAR(15) NOT NULL
   );
   
   CREATE TABLE customer 
   (
     c_custkey      INTEGER NOT NULL,
     c_name         VARCHAR(25) NOT NULL,
     c_address      VARCHAR(25) NOT NULL,
     c_city         VARCHAR(10) NOT NULL,
     c_nation       VARCHAR(15) NOT NULL,
     c_region       VARCHAR(12) NOT NULL,
     c_phone        VARCHAR(15) NOT NULL,
     c_mktsegment   VARCHAR(10) NOT NULL
   );
   
   CREATE TABLE dwdate 
   (
     d_datekey            INTEGER NOT NULL,
     d_date               VARCHAR(19) NOT NULL,
     d_dayofweek          VARCHAR(10) NOT NULL,
     d_month              VARCHAR(10) NOT NULL,
     d_year               INTEGER NOT NULL,
     d_yearmonthnum       INTEGER NOT NULL,
     d_yearmonth          VARCHAR(8) NOT NULL,
     d_daynuminweek       INTEGER NOT NULL,
     d_daynuminmonth      INTEGER NOT NULL,
     d_daynuminyear       INTEGER NOT NULL,
     d_monthnuminyear     INTEGER NOT NULL,
     d_weeknuminyear      INTEGER NOT NULL,
     d_sellingseason      VARCHAR(13) NOT NULL,
     d_lastdayinweekfl    VARCHAR(1) NOT NULL,
     d_lastdayinmonthfl   VARCHAR(1) NOT NULL,
     d_holidayfl          VARCHAR(1) NOT NULL,
     d_weekdayfl          VARCHAR(1) NOT NULL
   );
   CREATE TABLE lineorder 
   (
     lo_orderkey          INTEGER NOT NULL,
     lo_linenumber        INTEGER NOT NULL,
     lo_custkey           INTEGER NOT NULL,
     lo_partkey           INTEGER NOT NULL,
     lo_suppkey           INTEGER NOT NULL,
     lo_orderdate         INTEGER NOT NULL,
     lo_orderpriority     VARCHAR(15) NOT NULL,
     lo_shippriority      VARCHAR(1) NOT NULL,
     lo_quantity          INTEGER NOT NULL,
     lo_extendedprice     INTEGER NOT NULL,
     lo_ordertotalprice   INTEGER NOT NULL,
     lo_discount          INTEGER NOT NULL,
     lo_revenue           INTEGER NOT NULL,
     lo_supplycost        INTEGER NOT NULL,
     lo_tax               INTEGER NOT NULL,
     lo_commitdate        INTEGER NOT NULL,
     lo_shipmode          VARCHAR(10) NOT NULL
   );
   ```

## Step 5: Run the COPY commands
<a name="tutorial-loading-run-copy"></a>

You run COPY commands to load each of the tables in the SSB schema. The COPY command examples demonstrate loading from different file formats, using several COPY command options, and troubleshooting load errors.

### COPY command syntax
<a name="tutorial-loading-data-copy-syntax"></a>

The basic [COPY](r_COPY.md) command syntax is as follows. 

```
COPY table_name [ column_list ] FROM data_source CREDENTIALS access_credentials [options] 
```

To run a COPY command, you provide the following values. 
<a name="tutorial-loading-syntax-table-name"></a>
**Table name**  
The target table for the COPY command. The table must already exist in the database. The table can be temporary or persistent. The COPY command appends the new input data to any existing rows in the table. 
<a name="tutorial-loading-syntax-column-list"></a>
**Column list**  
By default, COPY loads fields from the source data to the table columns in order. You can optionally specify a *column list,* that is a comma-separated list of column names, to map data fields to specific columns. You don't use column lists in this tutorial. For more information, see [Column List](copy-parameters-column-mapping.md#copy-column-list) in the COPY command reference.

<a name="tutorial-loading-syntax-data-source.title"></a>Data source

You can use the COPY command to load data from an Amazon S3 bucket, an Amazon EMR cluster, a remote host using an SSH connection, or an Amazon DynamoDB table. For this tutorial, you load from data files in an Amazon S3 bucket. When loading from Amazon S3, you must provide the name of the bucket and the location of the data files. To do this, provide either an object path for the data files or the location of a manifest file that explicitly lists each data file and its location. 
+ Key prefix 

  An object stored in Amazon S3 is uniquely identified by an object key, which includes the bucket name, folder names, if any, and the object name. A *key prefix *refers to a set of objects with the same prefix. The object path is a key prefix that the COPY command uses to load all objects that share the key prefix. For example, the key prefix `custdata.txt` can refer to a single file or to a set of files, including `custdata.txt.001`, `custdata.txt.002`, and so on. 
+ Manifest file

  In some cases, you might need to load files with different prefixes, for example from multiple buckets or folders. In others, you might need to exclude files that share a prefix. In these cases, you can use a manifest file. A *manifest file* explicitly lists each load file and its unique object key. You use a manifest file to load the PART table later in this tutorial. 
<a name="tutorial-loading-syntax-credentials"></a>
**Credentials**  
To access the AWS resources that contain the data to load, you must provide AWS access credentials for a user with sufficient privileges. These credentials include an IAM role Amazon Resource Name (ARN). To load data from Amazon S3, the credentials must include ListBucket and GetObject permissions. Additional credentials are required if your data is encrypted. For more information, see [Authorization parameters](copy-parameters-authorization.md) in the COPY command reference. For more information about managing access, go to [Managing access permissions to your Amazon S3 resources](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-access-control.html). 

<a name="tutorial-loading-syntax-options.title"></a>Options

You can specify a number of parameters with the COPY command to specify file formats, manage data formats, manage errors, and control other features. In this tutorial, you use the following COPY command options and features: 
+ Key prefix

  For information on how to load from multiple files by specifying a key prefix, see [Load the PART table using NULL AS](#tutorial-loading-load-part).
+ CSV format

  For information on how to load data that is in CSV format, see [Load the PART table using NULL AS](#tutorial-loading-load-part).
+ NULL AS

  For information on how to load PART using the NULL AS option, see [Load the PART table using NULL AS](#tutorial-loading-load-part).
+ Character-delimited format

  For information on how to use the DELIMITER option, see [The DELIMITER and REGION options](#tutorial-loading-load-supplier).
+ REGION

  For information on how to use the REGION option, see [The DELIMITER and REGION options](#tutorial-loading-load-supplier).
+ Fixed-format width

  For information on how to load the CUSTOMER table from fixed-width data, see [Load the CUSTOMER table using MANIFEST](#tutorial-loading-load-customer).
+ MAXERROR

  For information on how to use the MAXERROR option, see [Load the CUSTOMER table using MANIFEST](#tutorial-loading-load-customer).
+ ACCEPTINVCHARS

  For information on how to use the ACCEPTINVCHARS option, see [Load the CUSTOMER table using MANIFEST](#tutorial-loading-load-customer).
+ MANIFEST

  For information on how to use the MANIFEST option, see [Load the CUSTOMER table using MANIFEST](#tutorial-loading-load-customer).
+ DATEFORMAT

  For information on how to use the DATEFORMAT option, see [Load the DWDATE table using DATEFORMAT](#tutorial-loading-load-dwdate).
+ GZIP, LZOP and BZIP2

  For information on how to compress your files, see [Load multiple data files](#tutorial-loading-load-lineorder).
+ COMPUPDATE

  For information on how to use the COMPUPDATE option, see [Load multiple data files](#tutorial-loading-load-lineorder).
+ Multiple files

  For information on how to load multiple files, see [Load multiple data files](#tutorial-loading-load-lineorder).

### Loading the SSB tables
<a name="tutorial-loading-run-copy-load-tables"></a>

You use the following COPY commands to load each of the tables in the SSB schema. The command to each table demonstrates different COPY options and troubleshooting techniques.

To load the SSB tables, follow these steps: 

1. [Replace the bucket name and AWS credentials](#tutorial-loading-run-copy-replaceables)

1. [Load the PART table using NULL AS](#tutorial-loading-load-part)

1. [Load the CUSTOMER table using MANIFEST](#tutorial-loading-load-customer)

1. [Load the DWDATE table using DATEFORMAT](#tutorial-loading-load-dwdate)

#### Replace the bucket name and AWS credentials
<a name="tutorial-loading-run-copy-replaceables"></a>

The COPY commands in this tutorial are presented in the following format.

```
copy table from 's3://<your-bucket-name>/load/key_prefix' 
credentials 'aws_iam_role=arn:aws:iam::<aws-account-id>:role/<role-name>'
options;
```

For each COPY command, do the following:

1. Replace *<your-bucket-name>* with the name of a bucket in the same region as your cluster. 

   This step assumes the bucket and the cluster are in the same region. Alternatively, you can specify the region using the [REGION](copy-parameters-data-source-s3.md#copy-region) option with the COPY command. 

1. Replace *<aws-account-id>* and *<role-name>* with your own AWS account and IAM role. The segment of the credentials string that is enclosed in single quotation marks must not contain any spaces or line breaks. Note that the ARN might differ slightly in format than the sample. It's best to copy the ARN for the role from the IAM console, to ensure that it's accurate, when you run the COPY commands. 

#### Load the PART table using NULL AS
<a name="tutorial-loading-load-part"></a>

In this step, you use the CSV and NULL AS options to load the PART table. 

The COPY command can load data from multiple files in parallel, which is much faster than loading from a single file. To demonstrate this principle, the data for each table in this tutorial is split into eight files, even though the files are very small. In a later step, you compare the time difference between loading from a single file and loading from multiple files. For more information, see [Loading data files](c_best-practices-use-multiple-files.md). 
<a name="tutorial-loading-key-prefix"></a>
**Key prefix**  
You can load from multiple files by specifying a key prefix for the file set, or by explicitly listing the files in a manifest file. In this step, you use a key prefix. In a later step, you use a manifest file. The key prefix `'s3://amzn-s3-demo-bucket/load/part-csv.tbl'` loads the following set of the files in the `load` folder. 

```
part-csv.tbl-000
part-csv.tbl-001
part-csv.tbl-002
part-csv.tbl-003
part-csv.tbl-004
part-csv.tbl-005
part-csv.tbl-006
part-csv.tbl-007
```
<a name="tutorial-loading-csv-format"></a>
**CSV format**  
CSV, which stands for comma separated values, is a common format used for importing and exporting spreadsheet data. CSV is more flexible than comma-delimited format because it enables you to include quoted strings within fields. The default quotation mark character for COPY from CSV format is a double quotation mark ( " ), but you can specify another quotation mark character by using the QUOTE AS option. When you use the quotation mark character within the field, escape the character with an additional quotation mark character.

The following excerpt from a CSV-formatted data file for the PART table shows strings enclosed in double quotation marks (`"LARGE ANODIZED BRASS"`). It also shows a string enclosed in two double quotation marks within a quoted string (`"MEDIUM ""BURNISHED"" TIN"`).

```
15,dark sky,MFGR#3,MFGR#47,MFGR#3438,indigo,"LARGE ANODIZED BRASS",45,LG CASE
22,floral beige,MFGR#4,MFGR#44,MFGR#4421,medium,"PROMO, POLISHED BRASS",19,LG DRUM
23,bisque slate,MFGR#4,MFGR#41,MFGR#4137,firebrick,"MEDIUM ""BURNISHED"" TIN",42,JUMBO JAR
```

The data for the PART table contains characters that cause COPY to fail. In this exercise, you troubleshoot the errors and correct them. 

To load data that is in CSV format, add `csv` to your COPY command. Run the following command to load the PART table. 

```
copy part from 's3://<your-bucket-name>/load/part-csv.tbl' 
credentials 'aws_iam_role=arn:aws:iam::<aws-account-id>:role/<role-name>'
csv;
```

You might get an error message similar to the following.

```
An error occurred when executing the SQL command:
copy part from 's3://amzn-s3-demo-bucket/load/part-csv.tbl' 
credentials' ...

ERROR: Load into table 'part' failed.  Check 'stl_load_errors' system table for details. [SQL State=XX000] 

Execution time: 1.46s

1 statement(s) failed.
1 statement(s) failed.
```

To get more information about the error, query the STL\$1LOAD\$1ERRORS table. The following query uses the SUBSTRING function to shorten columns for readability and uses LIMIT 10 to reduce the number of rows returned. You can adjust the values in `substring(filename,22,25)` to allow for the length of your bucket name.

```
select query, substring(filename,22,25) as filename,line_number as line, 
substring(colname,0,12) as column, type, position as pos, substring(raw_line,0,30) as line_text,
substring(raw_field_value,0,15) as field_text, 
substring(err_reason,0,45) as reason
from stl_load_errors 
order by query desc
limit 10;
```

```
 query  |    filename      | line |  column   |    type    | pos |      
--------+-------------------------+-----------+------------+------------+-----+----
 333765 | part-csv.tbl-000 |    1 |           |            |   0 |

 line_text        | field_text |                    reason
------------------+------------+----------------------------------------------
 15,NUL next,     |            | Missing newline: Unexpected character 0x2c f
```
<a name="tutorial-loading-null-as"></a>
**NULL AS**  
The `part-csv.tbl` data files use the NUL terminator character (`\x000` or `\x0`) to indicate NULL values.

**Note**  
Despite very similar spelling, NUL and NULL are not the same. NUL is a UTF-8 character with codepoint `x000` that is often used to indicate end of record (EOR). NULL is a SQL value that represents an absence of data. 

By default, COPY treats a NUL terminator character as an EOR character and terminates the record, which often results in unexpected results or an error. There is no single standard method of indicating NULL in text data. Thus, the NULL AS COPY command option enables you to specify which character to substitute with NULL when loading the table. In this example, you want COPY to treat the NUL terminator character as a NULL value.

**Note**  
The table column that receives the NULL value must be configured as *nullable.* That is, it must not include the NOT NULL constraint in the CREATE TABLE specification.

To load PART using the NULL AS option, run the following COPY command.

```
copy part from 's3://<your-bucket-name>/load/part-csv.tbl' 
credentials 'aws_iam_role=arn:aws:iam::<aws-account-id>:role/<role-name>' 
csv
null as '\000';
```

To verify that COPY loaded NULL values, run the following command to select only the rows that contain NULL.

```
select p_partkey, p_name, p_mfgr, p_category from part where p_mfgr is null;
```

```
 p_partkey |  p_name  | p_mfgr | p_category
-----------+----------+--------+------------
        15 | NUL next |        | MFGR#47
        81 | NUL next |        | MFGR#23
       133 | NUL next |        | MFGR#44 
(2 rows)
```

#### The DELIMITER and REGION options
<a name="tutorial-loading-load-supplier"></a>

The DELIMITER and REGION options are important to understand how load data.
<a name="tutorial-loading-character-delimited-format"></a>
**Character-delimited format**  
The fields in a character-delimited file are separated by a specific character, such as a pipe character ( \$1 ), a comma ( , ) or a tab ( \$1t ). Character-delimited files can use any single ASCII character, including one of the nonprinting ASCII characters, as the delimiter. You specify the delimiter character by using the DELIMITER option. The default delimiter is a pipe character ( \$1 ). 

The following excerpt from the data for the SUPPLIER table uses pipe-delimited format. 

```
1|1|257368|465569|41365|19950218|2-HIGH|0|17|2608718|9783671|4|2504369|92072|2|19950331|TRUCK
1|2|257368|201928|8146|19950218|2-HIGH|0|36|6587676|9783671|9|5994785|109794|6|19950416|MAIL
```
<a name="tutorial-loading-region"></a>
**REGION**  
Whenever possible, you should locate your load data in the same AWS region as your Amazon Redshift cluster. If your data and your cluster are in the same region, you reduce latency and avoid cross-region data transfer costs. For more information, see [Amazon Redshift best practices for loading data](c_loading-data-best-practices.md). 

If you must load data from a different AWS region, use the REGION option to specify the AWS region in which the load data is located. If you specify a region, all of the load data, including manifest files, must be in the named region. For more information, see [REGION](copy-parameters-data-source-s3.md#copy-region). 

For example, if your cluster is in the US East (N. Virginia) Region, and your Amazon S3 bucket is located in the US West (Oregon) Region, the following COPY command shows how to load the SUPPLIER table from pipe-delimited data. 

```
copy supplier from 's3://amzn-s3-demo-bucket/ssb/supplier.tbl' 
credentials 'aws_iam_role=arn:aws:iam::<aws-account-id>:role/<role-name>' 
delimiter '|' 
gzip
region 'us-west-2';
```

#### Load the CUSTOMER table using MANIFEST
<a name="tutorial-loading-load-customer"></a>

In this step, you use the FIXEDWIDTH, MAXERROR, ACCEPTINVCHARS, and MANIFEST options to load the CUSTOMER table.

The sample data for this exercise contains characters that cause errors when COPY attempts to load them. You use the MAXERRORS option and the STL\$1LOAD\$1ERRORS system table to troubleshoot the load errors and then use the ACCEPTINVCHARS and MANIFEST options to eliminate the errors.
<a name="tutorial-loading-fixed-width"></a>
**Fixed-Width Format**  
Fixed-width format defines each field as a fixed number of characters, rather than separating fields with a delimiter. The following excerpt from the data for the CUSTOMER table uses fixed-width format.

```
1   Customer#000000001   IVhzIApeRb           MOROCCO  0MOROCCO  AFRICA      25-705 
2   Customer#000000002   XSTf4,NCwDVaWNe6tE   JORDAN   6JORDAN   MIDDLE EAST 23-453
3   Customer#000000003   MG9kdTD              ARGENTINA5ARGENTINAAMERICA     11-783
```

The order of the label/width pairs must match the order of the table columns exactly. For more information, see [FIXEDWIDTH](copy-parameters-data-format.md#copy-fixedwidth).

The fixed-width specification string for the CUSTOMER table data is as follows.

```
fixedwidth 'c_custkey:10, c_name:25, c_address:25, c_city:10, c_nation:15, 
c_region :12, c_phone:15,c_mktsegment:10'
```

To load the CUSTOMER table from fixed-width data, run the following command.

```
copy customer
from 's3://<your-bucket-name>/load/customer-fw.tbl'
credentials 'aws_iam_role=arn:aws:iam::<aws-account-id>:role/<role-name>' 
fixedwidth 'c_custkey:10, c_name:25, c_address:25, c_city:10, c_nation:15, c_region :12, c_phone:15,c_mktsegment:10';
```

You should get an error message, similar to the following.

```
An error occurred when executing the SQL command:
copy customer
from 's3://amzn-s3-demo-bucket/load/customer-fw.tbl'
credentials'...

ERROR: Load into table 'customer' failed.  Check 'stl_load_errors' system table for details. [SQL State=XX000] 

Execution time: 2.95s

1 statement(s) failed.
```
<a name="tutorial-loading-maxerror"></a>
**MAXERROR**  
By default, the first time COPY encounters an error, the command fails and returns an error message. To save time during testing, you can use the MAXERROR option to instruct COPY to skip a specified number of errors before it fails. Because we expect errors the first time we test loading the CUSTOMER table data, add `maxerror 10` to the COPY command. 

To test using the FIXEDWIDTH and MAXERROR options, run the following command.

```
copy customer
from 's3://<your-bucket-name>/load/customer-fw.tbl'
credentials 'aws_iam_role=arn:aws:iam::<aws-account-id>:role/<role-name>' 
fixedwidth 'c_custkey:10, c_name:25, c_address:25, c_city:10, c_nation:15, c_region :12, c_phone:15,c_mktsegment:10'
maxerror 10;
```

This time, instead of an error message, you get a warning message similar to the following.

```
Warnings:
Load into table 'customer' completed, 112497 record(s) loaded successfully.
Load into table 'customer' completed, 7 record(s) could not be loaded.  Check 'stl_load_errors' system table for details.
```

The warning indicates that COPY encountered seven errors. To check the errors, query the STL\$1LOAD\$1ERRORS table, as shown in the following example.

```
select query, substring(filename,22,25) as filename,line_number as line, 
substring(colname,0,12) as column, type, position as pos, substring(raw_line,0,30) as line_text,
substring(raw_field_value,0,15) as field_text, 
substring(err_reason,0,45) as error_reason
from stl_load_errors 
order by query desc, filename 
limit 7;
```

The results of the STL\$1LOAD\$1ERRORS query should look similar to the following.

```
 query  |         filename          | line |  column   |    type    | pos |           line_text           | field_text |              error_reason
--------+---------------------------+------+-----------+------------+-----+-------------------------------+------------+----------------------------------------------
 334489 | customer-fw.tbl.log       |    2 | c_custkey | int4       |  -1 | customer-fw.tbl               | customer-f | Invalid digit, Value 'c', Pos 0, Type: Integ
 334489 | customer-fw.tbl.log       |    6 | c_custkey | int4       |  -1 | Complete                      | Complete   | Invalid digit, Value 'C', Pos 0, Type: Integ
 334489 | customer-fw.tbl.log       |    3 | c_custkey | int4       |  -1 | #Total rows                   | #Total row | Invalid digit, Value '#', Pos 0, Type: Integ
 334489 | customer-fw.tbl.log       |    5 | c_custkey | int4       |  -1 | #Status                       | #Status    | Invalid digit, Value '#', Pos 0, Type: Integ
 334489 | customer-fw.tbl.log       |    1 | c_custkey | int4       |  -1 | #Load file                    | #Load file | Invalid digit, Value '#', Pos 0, Type: Integ
 334489 | customer-fw.tbl000        |    1 | c_address | varchar    |  34 | 1         Customer#000000001  | .Mayag.ezR | String contains invalid or unsupported UTF8
 334489 | customer-fw.tbl000        |    1 | c_address | varchar    |  34 | 1         Customer#000000001  | .Mayag.ezR | String contains invalid or unsupported UTF8
(7 rows)
```

By examining the results, you can see that there are two messages in the `error_reasons` column:
+ 

  ```
  Invalid digit, Value '#', Pos 0, Type: Integ 
  ```

  These errors are caused by the `customer-fw.tbl.log` file. The problem is that it is a log file, not a data file, and should not be loaded. You can use a manifest file to avoid loading the wrong file. 
+ 

  ```
  String contains invalid or unsupported UTF8 
  ```

  The VARCHAR data type supports multibyte UTF-8 characters up to three bytes. If the load data contains unsupported or invalid characters, you can use the ACCEPTINVCHARS option to replace each invalid character with a specified alternative character.

Another problem with the load is more difficult to detect—the load produced unexpected results. To investigate this problem, run the following command to query the CUSTOMER table.

```
select c_custkey, c_name, c_address        
from customer
order by c_custkey
limit 10;
```

```
 c_custkey |          c_name           |         c_address
-----------+---------------------------+---------------------------
         2 | Customer#000000002        | XSTf4,NCwDVaWNe6tE
         2 | Customer#000000002        | XSTf4,NCwDVaWNe6tE
         3 | Customer#000000003        | MG9kdTD
         3 | Customer#000000003        | MG9kdTD
         4 | Customer#000000004        | XxVSJsL
         4 | Customer#000000004        | XxVSJsL
         5 | Customer#000000005        | KvpyuHCplrB84WgAi
         5 | Customer#000000005        | KvpyuHCplrB84WgAi
         6 | Customer#000000006        | sKZz0CsnMD7mp4Xd0YrBvx
         6 | Customer#000000006        | sKZz0CsnMD7mp4Xd0YrBvx
(10 rows)
```

The rows should be unique, but there are duplicates. 

Another way to check for unexpected results is to verify the number of rows that were loaded. In our case, 100000 rows should have been loaded, but the load message reported loading 112497 records. The extra rows were loaded because the COPY loaded an extraneous file, `customer-fw.tbl0000.bak`. 

In this exercise, you use a manifest file to avoid loading the wrong files. 
<a name="tutorial-loading-acceptinvchars"></a>
**ACCEPTINVCHARS**  
By default, when COPY encounters a character that is not supported by the column's data type, it skips the row and returns an error. For information about invalid UTF-8 characters, see [Multibyte character load errors](multi-byte-character-load-errors.md). 

You could use the MAXERRORS option to ignore errors and continue loading, then query STL\$1LOAD\$1ERRORS to locate the invalid characters, and then fix the data files. However, MAXERRORS is best used for troubleshooting load problems and should generally not be used in a production environment. 

The ACCEPTINVCHARS option is usually a better choice for managing invalid characters. ACCEPTINVCHARS instructs COPY to replace each invalid character with a specified valid character and continue with the load operation. You can specify any valid ASCII character, except NULL, as the replacement character. The default replacement character is a question mark ( ? ). COPY replaces multibyte characters with a replacement string of equal length. For example, a 4-byte character would be replaced with `'????'`. 

COPY returns the number of rows that contained invalid UTF-8 characters. It also adds an entry to the STL\$1REPLACEMENTS system table for each affected row, up to a maximum of 100 rows per node slice. Additional invalid UTF-8 characters are also replaced, but those replacement events are not recorded. 

ACCEPTINVCHARS is valid only for VARCHAR columns. 

For this step, you add the ACCEPTINVCHARS with the replacement character `'^'`. 
<a name="tutorial-loading-manifest"></a>
**MANIFEST**  
When you COPY from Amazon S3 using a key prefix, there is a risk that you might load unwanted tables. For example, the `'s3://amzn-s3-demo-bucket/load/` folder contains eight data files that share the key prefix `customer-fw.tbl`: `customer-fw.tbl0000`, `customer-fw.tbl0001`, and so on. However, the same folder also contains the extraneous files `customer-fw.tbl.log` and `customer-fw.tbl-0001.bak`. 

To ensure that you load all of the correct files, and only the correct files, use a manifest file. The manifest is a text file in JSON format that explicitly lists the unique object key for each source file to be loaded. The file objects can be in different folders or different buckets, but they must be in the same region. For more information, see [MANIFEST](copy-parameters-data-source-s3.md#copy-manifest).

The following shows the `customer-fw-manifest` text. 

```
{
  "entries": [
    {"url":"s3://<your-bucket-name>/load/customer-fw.tbl-000"},
    {"url":"s3://<your-bucket-name>/load/customer-fw.tbl-001"},
    {"url":"s3://<your-bucket-name>/load/customer-fw.tbl-002"},
    {"url":"s3://<your-bucket-name>/load/customer-fw.tbl-003"},
    {"url":"s3://<your-bucket-name>/load/customer-fw.tbl-004"},    
    {"url":"s3://<your-bucket-name>/load/customer-fw.tbl-005"},
    {"url":"s3://<your-bucket-name>/load/customer-fw.tbl-006"}, 
    {"url":"s3://<your-bucket-name>/load/customer-fw.tbl-007"} 
    ]
}
```

**To load the data for the CUSTOMER table using the manifest file**

1. Open the file `customer-fw-manifest` in a text editor.

1. Replace *<your-bucket-name>* with the name of your bucket.

1. Save the file.

1. Upload the file to the load folder on your bucket.

1. Run the following COPY command.

   ```
   copy customer from 's3://<your-bucket-name>/load/customer-fw-manifest'
   credentials 'aws_iam_role=arn:aws:iam::<aws-account-id>:role/<role-name>' 
   fixedwidth 'c_custkey:10, c_name:25, c_address:25, c_city:10, c_nation:15, c_region :12, c_phone:15,c_mktsegment:10'
   maxerror 10 
   acceptinvchars as '^'
   manifest;
   ```

#### Load the DWDATE table using DATEFORMAT
<a name="tutorial-loading-load-dwdate"></a>

In this step, you use the DELIMITER and DATEFORMAT options to load the DWDATE table.

When loading DATE and TIMESTAMP columns, COPY expects the default format, which is YYYY-MM-DD for dates and YYYY-MM-DD HH:MI:SS for timestamps. If the load data does not use a default format, you can use DATEFORMAT and TIMEFORMAT to specify the format. 

The following excerpt shows date formats in the DWDATE table. Notice that the date formats in column two are inconsistent.

```
19920104	1992-01-04          Sunday		January	1992	199201	Jan1992	1	4	4	1...
19920112	January 12, 1992	Monday		January	1992	199201	Jan1992	2	12	12	1...
19920120	January 20, 1992	Tuesday	    January	1992	199201	Jan1992	3	20	20	1...
```
<a name="tutorial-loading-dateformat"></a>
**DATEFORMAT**  
You can specify only one date format. If the load data contains inconsistent formats, possibly in different columns, or if the format is not known at load time, you use DATEFORMAT with the `'auto'` argument. When `'auto'` is specified, COPY recognizes any valid date or time format and convert it to the default format. The `'auto'` option recognizes several formats that are not supported when using a DATEFORMAT and TIMEFORMAT string. For more information, see [Using automatic recognition with DATEFORMAT and TIMEFORMAT](automatic-recognition.md). 

To load the DWDATE table, run the following COPY command.

```
copy dwdate from 's3://<your-bucket-name>/load/dwdate-tab.tbl'
credentials 'aws_iam_role=arn:aws:iam::<aws-account-id>:role/<role-name>' 
delimiter '\t' 
dateformat 'auto';
```

#### Load multiple data files
<a name="tutorial-loading-load-lineorder"></a>

You can use the GZIP and COMPUPDATE options to load a table.

You can load a table from a single data file or multiple files. Do this to compare the load times for the two methods. 
<a name="tutorial-loading-gzip-lzop"></a>
**GZIP, LZOP and BZIP2**  
You can compress your files using either gzip, lzop, or bzip2 compression formats. When loading from compressed files, COPY uncompresses the files during the load process. Compressing your files saves storage space and shortens upload times. 
<a name="tutorial-loading-compupdate"></a>
**COMPUPDATE**  
When COPY loads an empty table with no compression encodings, it analyzes the load data to determine the optimal encodings. It then alters the table to use those encodings before beginning the load. This analysis process takes time, but it occurs, at most, once per table. To save time, you can skip this step by turning COMPUPDATE off. To enable an accurate evaluation of COPY times, you turn COMPUPDATE off for this step.
<a name="tutorial-loading-multiple-files"></a>
**Multiple Files**  
The COPY command can load data very efficiently when it loads from multiple files in parallel instead of from a single file. You can split your data into files so that the number of files is a multiple of the number of slices in your cluster. If you do, Amazon Redshift divides the workload and distributes the data evenly among the slices. The number of slices per node depends on the node size of the cluster. For more information about the number of slices that each node size has, see [About clusters and nodes](https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-clusters.html#rs-about-clusters-and-nodes) in the *Amazon Redshift Management Guide*.

For example, the compute nodes in your cluster in this tutorial can have two slices each, so a four-node cluster has eight slices. In previous steps, the load data was contained in eight files, even though the files are very small. You can compare the time difference between loading from a single large file and loading from multiple files. 

Even files that contain 15 million records and occupy about 1.2 GB are very small in Amazon Redshift scale. But they are sufficient to demonstrate the performance advantage of loading from multiple files. 

The following image shows the data files for LINEORDER.

![\[The data in the LINEORDER table split into nine files.\]](http://docs.aws.amazon.com/redshift/latest/dg/images/tutorial-load-lineorder-files.png)


**To evaluate the performance of COPY with multiple files**

1. In a lab test, the following command was run to COPY from a single file. This command shows a fictitious bucket.

   ```
   copy lineorder from 's3://amzn-s3-demo-bucket/load/lo/lineorder-single.tbl' 
   credentials 'aws_iam_role=arn:aws:iam::<aws-account-id>:role/<role-name>' 
   gzip
   compupdate off
   region 'us-east-1';
   ```

1. The results were as follows. Note the execution time.

   ```
   Warnings:
   Load into table 'lineorder' completed, 14996734 record(s) loaded successfully.
   
   0 row(s) affected.
   copy executed successfully
   
   Execution time: 51.56s
   ```

1. Then the following command to COPY from multiple files was run.

   ```
   copy lineorder from 's3://amzn-s3-demo-bucket/load/lo/lineorder-multi.tbl' 
   credentials 'aws_iam_role=arn:aws:iam::<aws-account-id>:role/<role-name>' 
   gzip
   compupdate off
   region 'us-east-1';
   ```

1. The results were as follows. Note the execution time.

   ```
   Warnings:
   Load into table 'lineorder' completed, 14996734 record(s) loaded successfully.
   
   0 row(s) affected.
   copy executed successfully
   
   Execution time: 17.7s
   ```

1. Compare execution times.

   In our experiment, the time to load 15 million records decreased from 51.56 seconds to 17.7 seconds, a reduction of 65.7 percent. 

   These results are based on using a four-node cluster. If your cluster has more nodes, the time savings is multiplied. For typical Amazon Redshift clusters, with tens to hundreds of nodes, the difference is even more dramatic. If you have a single node cluster, there is little difference between the execution times. 

## Step 6: Vacuum and analyze the database
<a name="tutorial-loading-data-vacuum"></a>

Whenever you add, delete, or modify a significant number of rows, you should run a VACUUM command and then an ANALYZE command. A *vacuum* recovers the space from deleted rows and restores the sort order. The ANALYZE command updates the statistics metadata, which enables the query optimizer to generate more accurate query plans. For more information, see [Vacuuming tables](t_Reclaiming_storage_space202.md). 

If you load the data in sort key order, a vacuum is fast. In this tutorial, you added a significant number of rows, but you added them to empty tables. That being the case, there is no need to resort, and you didn't delete any rows. COPY automatically updates statistics after loading an empty table, so your statistics should be up-to-date. However, as a matter of good housekeeping, you complete this tutorial by vacuuming and analyzing your database.

To vacuum and analyze the database, run the following commands.

```
vacuum;
analyze;
```

## Step 7: Clean up your resources
<a name="tutorial-loading-data-clean-up"></a>

Your cluster continues to accrue charges as long as it is running. When you have completed this tutorial, you should return your environment to the previous state by following the steps in [Step 5: Revoke access and delete your sample cluster](https://docs.aws.amazon.com/redshift/latest/gsg/rs-gsg-clean-up-tasks.html) in the *Amazon Redshift Getting Started Guide*.

If you want to keep the cluster, but recover the storage used by the SSB tables, run the following commands.

```
drop table part;
drop table supplier;
drop table customer;
drop table dwdate;
drop table lineorder;
```

### Next
<a name="tutorial-loading-next-summary"></a>

[Summary](#tutorial-loading-data-summary)

## Summary
<a name="tutorial-loading-data-summary"></a>

In this tutorial, you uploaded data files to Amazon S3 and then used COPY commands to load the data from the files into Amazon Redshift tables.

You loaded data using the following formats:
+ Character-delimited
+ CSV
+ Fixed-width

You used the STL\$1LOAD\$1ERRORS system table to troubleshoot load errors, and then used the REGION, MANIFEST, MAXERROR, ACCEPTINVCHARS, DATEFORMAT, and NULL AS options to resolve the errors.

You applied the following best practices for loading data: 
+ [Use a COPY command to load data](c_best-practices-use-copy.md)
+ [Loading data files](c_best-practices-use-multiple-files.md)
+ [Use a single COPY command to load from multiple files](c_best-practices-single-copy-command.md)
+ [Compressing your data files](c_best-practices-compress-data-files.md)
+ [Verify data files before and after a load](c_best-practices-verifying-data-files.md)

For more information about Amazon Redshift best practices, see the following links: 
+ [Amazon Redshift best practices for loading data](c_loading-data-best-practices.md)
+ [Amazon Redshift best practices for designing tables](c_designing-tables-best-practices.md) 
+ [Amazon Redshift best practices for designing queries](c_designing-queries-best-practices.md)