COPY examples - Amazon Redshift

COPY examples

Note

These examples contain line breaks for readability. Do not include line breaks or spaces in your credentials-args string.

Load FAVORITEMOVIES from an DynamoDB table

The AWS SDKs include a simple example of creating a DynamoDB table called Movies. (For this example, see Getting Started with DynamoDB.) The following example loads the Amazon Redshift MOVIES table with data from the DynamoDB table. The Amazon Redshift table must already exist in the database.

copy favoritemovies from 'dynamodb://Movies' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' readratio 50;

Load LISTING from an Amazon S3 bucket

The following example loads LISTING from an Amazon S3 bucket. The COPY command loads all of the files in the /data/listing/ folder.

copy listing from 's3://amzn-s3-demo-bucket/data/listing/' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole';

Load LISTING from an Amazon EMR cluster

The following example loads the SALES table with tab-delimited data from lzop-compressed files in an Amazon EMR cluster. COPY loads every file in the myoutput/ folder that begins with part-.

copy sales from 'emr://j-SAMPLE2B500FC/myoutput/part-*' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' delimiter '\t' lzop;

The following example loads the SALES table with JSON formatted data in an Amazon EMR cluster. COPY loads every file in the myoutput/json/ folder.

copy sales from 'emr://j-SAMPLE2B500FC/myoutput/json/' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' JSON 's3://amzn-s3-demo-bucket/jsonpaths.txt';

Using a manifest to specify data files

You can use a manifest to ensure that your COPY command loads all of the required files, and only the required files, from Amazon S3. You can also use a manifest when you need to load multiple files from different buckets or files that don't share the same prefix.

For example, suppose that you need to load the following three files: custdata1.txt, custdata2.txt, and custdata3.txt. You could use the following command to load all of the files in amzn-s3-demo-bucket that begin with custdata by specifying a prefix:

copy category from 's3://amzn-s3-demo-bucket/custdata' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole';

If only two of the files exist because of an error, COPY loads only those two files and finishes successfully, resulting in an incomplete data load. If the bucket also contains an unwanted file that happens to use the same prefix, such as a file named custdata.backup for example, COPY loads that file as well, resulting in unwanted data being loaded.

To ensure that all of the required files are loaded and to prevent unwanted files from being loaded, you can use a manifest file. The manifest is a JSON-formatted text file that lists the files to be processed by the COPY command. For example, the following manifest loads the three files in the previous example.

{ "entries":[ { "url":"s3://amzn-s3-demo-bucket/custdata.1", "mandatory":true }, { "url":"s3://amzn-s3-demo-bucket/custdata.2", "mandatory":true }, { "url":"s3://amzn-s3-demo-bucket/custdata.3", "mandatory":true } ] }

The optional mandatory flag indicates whether COPY should terminate if the file doesn't exist. The default is false. Regardless of any mandatory settings, COPY terminates if no files are found. In this example, COPY returns an error if any of the files isn't found. Unwanted files that might have been picked up if you specified only a key prefix, such as custdata.backup, are ignored, because they aren't on the manifest.

When loading from data files in ORC or Parquet format, a meta field is required, as shown in the following example.

{ "entries":[ { "url":"s3://amzn-s3-demo-bucket1/orc/2013-10-04-custdata", "mandatory":true, "meta":{ "content_length":99 } }, { "url":"s3://amzn-s3-demo-bucket2/orc/2013-10-05-custdata", "mandatory":true, "meta":{ "content_length":99 } } ] }

The following example uses a manifest named cust.manifest.

copy customer from 's3://amzn-s3-demo-bucket/cust.manifest' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' format as orc manifest;

You can use a manifest to load files from different buckets or files that don't share the same prefix. The following example shows the JSON to load data with files whose names begin with a date stamp.

{ "entries": [ {"url":"s3://amzn-s3-demo-bucket/2013-10-04-custdata.txt","mandatory":true}, {"url":"s3://amzn-s3-demo-bucket/2013-10-05-custdata.txt","mandatory":true}, {"url":"s3://amzn-s3-demo-bucket/2013-10-06-custdata.txt","mandatory":true}, {"url":"s3://amzn-s3-demo-bucket/2013-10-07-custdata.txt","mandatory":true} ] }

The manifest can list files that are in different buckets, as long as the buckets are in the same AWS Region as the cluster.

{ "entries": [ {"url":"s3://amzn-s3-demo-bucket1/custdata1.txt","mandatory":false}, {"url":"s3://amzn-s3-demo-bucket2/custdata1.txt","mandatory":false}, {"url":"s3://amzn-s3-demo-bucket2/custdata2.txt","mandatory":false} ] }

Load LISTING from a pipe-delimited file (default delimiter)

The following example is a very simple case in which no options are specified and the input file contains the default delimiter, a pipe character ('|').

copy listing from 's3://amzn-s3-demo-bucket/data/listings_pipe.txt' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole';

Load LISTING using columnar data in Parquet format

The following example loads data from a folder on Amazon S3 named parquet.

copy listing from 's3://amzn-s3-demo-bucket/data/listings/parquet/' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' format as parquet;

Load LISTING using columnar data in ORC format

The following example loads data from a folder on Amazon S3 named orc.

copy listing from 's3://amzn-s3-demo-bucket/data/listings/orc/' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' format as orc;

Load EVENT with options

The following example loads pipe-delimited data into the EVENT table and applies the following rules:

  • If pairs of quotation marks are used to surround any character strings, they are removed.

  • Both empty strings and strings that contain blanks are loaded as NULL values.

  • The load fails if more than 5 errors are returned.

  • Timestamp values must comply with the specified format; for example, a valid timestamp is 2008-09-26 05:43:12.

copy event from 's3://amzn-s3-demo-bucket/data/allevents_pipe.txt' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' removequotes emptyasnull blanksasnull maxerror 5 delimiter '|' timeformat 'YYYY-MM-DD HH:MI:SS';

Load VENUE from a fixed-width data file

copy venue from 's3://amzn-s3-demo-bucket/data/venue_fw.txt' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' fixedwidth 'venueid:3,venuename:25,venuecity:12,venuestate:2,venueseats:6';

The preceding example assumes a data file formatted in the same way as the sample data shown. In the sample following, spaces act as placeholders so that all of the columns are the same width as noted in the specification:

1 Toyota Park Bridgeview IL0 2 Columbus Crew Stadium Columbus OH0 3 RFK Stadium Washington DC0 4 CommunityAmerica BallparkKansas City KS0 5 Gillette Stadium Foxborough MA68756

Load CATEGORY from a CSV file

Suppose you want to load the CATEGORY with the values shown in the following table.

catid catgroup catname catdesc
12 Shows Musicals Musical theatre
13 Shows Plays All "non-musical" theatre
14 Shows Opera All opera, light, and "rock" opera
15 Concerts Classical All symphony, concerto, and choir concerts

The following example shows the contents of a text file with the field values separated by commas.

12,Shows,Musicals,Musical theatre 13,Shows,Plays,All "non-musical" theatre 14,Shows,Opera,All opera, light, and "rock" opera 15,Concerts,Classical,All symphony, concerto, and choir concerts

If you load the file using the DELIMITER parameter to specify comma-delimited input, the COPY command fails because some input fields contain commas. You can avoid that problem by using the CSV parameter and enclosing the fields that contain commas in quotation mark characters. If the quotation mark character appears within a quoted string, you need to escape it by doubling the quotation mark character. The default quotation mark character is a double quotation mark, so you need to escape each double quotation mark with an additional double quotation mark. Your new input file looks something like this.

12,Shows,Musicals,Musical theatre 13,Shows,Plays,"All ""non-musical"" theatre" 14,Shows,Opera,"All opera, light, and ""rock"" opera" 15,Concerts,Classical,"All symphony, concerto, and choir concerts"

Assuming the file name is category_csv.txt, you can load the file by using the following COPY command:

copy category from 's3://amzn-s3-demo-bucket/data/category_csv.txt' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' csv;

Alternatively, to avoid the need to escape the double quotation marks in your input, you can specify a different quotation mark character by using the QUOTE AS parameter. For example, the following version of category_csv.txt uses '%' as the quotation mark character.

12,Shows,Musicals,Musical theatre 13,Shows,Plays,%All "non-musical" theatre% 14,Shows,Opera,%All opera, light, and "rock" opera% 15,Concerts,Classical,%All symphony, concerto, and choir concerts%

The following COPY command uses QUOTE AS to load category_csv.txt:

copy category from 's3://amzn-s3-demo-bucket/data/category_csv.txt' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' csv quote as '%';

Load VENUE with explicit values for an IDENTITY column

The following example assumes that when the VENUE table was created that at least one column (such as the venueid column) was specified to be an IDENTITY column. This command overrides the default IDENTITY behavior of autogenerating values for an IDENTITY column and instead loads the explicit values from the venue.txt file. Amazon Redshift does not check if duplicate IDENTITY values are loaded into the table when using the EXLICIT_IDS option.

copy venue from 's3://amzn-s3-demo-bucket/data/venue.txt' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' explicit_ids;

Load TIME from a pipe-delimited GZIP file

The following example loads the TIME table from a pipe-delimited GZIP file:

copy time from 's3://amzn-s3-demo-bucket/data/timerows.gz' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' gzip delimiter '|';

Load a timestamp or datestamp

The following example loads data with a formatted timestamp.

Note

The TIMEFORMAT of HH:MI:SS can also support fractional seconds beyond the SS to a microsecond level of detail. The file time.txt used in this example contains one row, 2009-01-12 14:15:57.119568.

copy timestamp1 from 's3://amzn-s3-demo-bucket/data/time.txt' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' timeformat 'YYYY-MM-DD HH:MI:SS';

The result of this copy is as follows:

select * from timestamp1; c1 ---------------------------- 2009-01-12 14:15:57.119568 (1 row)

Load data from a file with default values

The following example uses a variation of the VENUE table in the TICKIT database. Consider a VENUE_NEW table defined with the following statement:

create table venue_new( venueid smallint not null, venuename varchar(100) not null, venuecity varchar(30), venuestate char(2), venueseats integer not null default '1000');

Consider a venue_noseats.txt data file that contains no values for the VENUESEATS column, as shown in the following example:

1|Toyota Park|Bridgeview|IL| 2|Columbus Crew Stadium|Columbus|OH| 3|RFK Stadium|Washington|DC| 4|CommunityAmerica Ballpark|Kansas City|KS| 5|Gillette Stadium|Foxborough|MA| 6|New York Giants Stadium|East Rutherford|NJ| 7|BMO Field|Toronto|ON| 8|The Home Depot Center|Carson|CA| 9|Dick's Sporting Goods Park|Commerce City|CO| 10|Pizza Hut Park|Frisco|TX|

The following COPY statement will successfully load the table from the file and apply the DEFAULT value ('1000') to the omitted column:

copy venue_new(venueid, venuename, venuecity, venuestate) from 's3://amzn-s3-demo-bucket/data/venue_noseats.txt' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' delimiter '|';

Now view the loaded table:

select * from venue_new order by venueid; venueid | venuename | venuecity | venuestate | venueseats ---------+----------------------------+-----------------+------------+------------ 1 | Toyota Park | Bridgeview | IL | 1000 2 | Columbus Crew Stadium | Columbus | OH | 1000 3 | RFK Stadium | Washington | DC | 1000 4 | CommunityAmerica Ballpark | Kansas City | KS | 1000 5 | Gillette Stadium | Foxborough | MA | 1000 6 | New York Giants Stadium | East Rutherford | NJ | 1000 7 | BMO Field | Toronto | ON | 1000 8 | The Home Depot Center | Carson | CA | 1000 9 | Dick's Sporting Goods Park | Commerce City | CO | 1000 10 | Pizza Hut Park | Frisco | TX | 1000 (10 rows)

For the following example, in addition to assuming that no VENUESEATS data is included in the file, also assume that no VENUENAME data is included:

1||Bridgeview|IL| 2||Columbus|OH| 3||Washington|DC| 4||Kansas City|KS| 5||Foxborough|MA| 6||East Rutherford|NJ| 7||Toronto|ON| 8||Carson|CA| 9||Commerce City|CO| 10||Frisco|TX|

Using the same table definition, the following COPY statement fails because no DEFAULT value was specified for VENUENAME, and VENUENAME is a NOT NULL column:

copy venue(venueid, venuecity, venuestate) from 's3://amzn-s3-demo-bucket/data/venue_pipe.txt' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' delimiter '|';

Now consider a variation of the VENUE table that uses an IDENTITY column:

create table venue_identity( venueid int identity(1,1), venuename varchar(100) not null, venuecity varchar(30), venuestate char(2), venueseats integer not null default '1000');

As with the previous example, assume that the VENUESEATS column has no corresponding values in the source file. The following COPY statement successfully loads the table, including the predefined IDENTITY data values instead of autogenerating those values:

copy venue(venueid, venuename, venuecity, venuestate) from 's3://amzn-s3-demo-bucket/data/venue_pipe.txt' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' delimiter '|' explicit_ids;

This statement fails because it doesn't include the IDENTITY column (VENUEID is missing from the column list) yet includes an EXPLICIT_IDS parameter:

copy venue(venuename, venuecity, venuestate) from 's3://amzn-s3-demo-bucket/data/venue_pipe.txt' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' delimiter '|' explicit_ids;

This statement fails because it doesn't include an EXPLICIT_IDS parameter:

copy venue(venueid, venuename, venuecity, venuestate) from 's3://amzn-s3-demo-bucket/data/venue_pipe.txt' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' delimiter '|';

COPY data with the ESCAPE option

The following example shows how to load characters that match the delimiter character (in this case, the pipe character). In the input file, make sure that all of the pipe characters (|) that you want to load are escaped with the backslash character (\). Then load the file with the ESCAPE parameter.

$ more redshiftinfo.txt 1|public\|event\|dwuser 2|public\|sales\|dwuser create table redshiftinfo(infoid int,tableinfo varchar(50)); copy redshiftinfo from 's3://amzn-s3-demo-bucket/data/redshiftinfo.txt' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' delimiter '|' escape; select * from redshiftinfo order by 1; infoid | tableinfo -------+-------------------- 1 | public|event|dwuser 2 | public|sales|dwuser (2 rows)

Without the ESCAPE parameter, this COPY command fails with an Extra column(s) found error.

Important

If you load your data using a COPY with the ESCAPE parameter, you must also specify the ESCAPE parameter with your UNLOAD command to generate the reciprocal output file. Similarly, if you UNLOAD using the ESCAPE parameter, you need to use ESCAPE when you COPY the same data.

Copy from JSON examples

In the following examples, you load the CATEGORY table with the following data.

CATID CATGROUP CATNAME CATDESC
1 Sports MLB Major League Baseball
2 Sports NHL National Hockey League
3 Sports NFL National Football League
4 Sports NBA National Basketball Association
5 Concerts Classical All symphony, concerto, and choir concerts

Load from JSON data using the 'auto' option

To load from JSON data using the 'auto' option, the JSON data must consist of a set of objects. The key names must match the column names, but the order doesn't matter. The following shows the contents of a file named category_object_auto.json.

{ "catdesc": "Major League Baseball", "catid": 1, "catgroup": "Sports", "catname": "MLB" } { "catgroup": "Sports", "catid": 2, "catname": "NHL", "catdesc": "National Hockey League" }{ "catid": 3, "catname": "NFL", "catgroup": "Sports", "catdesc": "National Football League" } { "bogus": "Bogus Sports LLC", "catid": 4, "catgroup": "Sports", "catname": "NBA", "catdesc": "National Basketball Association" } { "catid": 5, "catgroup": "Shows", "catname": "Musicals", "catdesc": "All symphony, concerto, and choir concerts" }

To load from the JSON data file in the previous example, run the following COPY command.

copy category from 's3://amzn-s3-demo-bucket/category_object_auto.json' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' json 'auto';

Load from JSON data using the 'auto ignorecase' option

To load from JSON data using the 'auto ignorecase' option, the JSON data must consist of a set of objects. The case of the key names doesn't have to match the column names and the order doesn't matter. The following shows the contents of a file named category_object_auto-ignorecase.json.

{ "CatDesc": "Major League Baseball", "CatID": 1, "CatGroup": "Sports", "CatName": "MLB" } { "CatGroup": "Sports", "CatID": 2, "CatName": "NHL", "CatDesc": "National Hockey League" }{ "CatID": 3, "CatName": "NFL", "CatGroup": "Sports", "CatDesc": "National Football League" } { "bogus": "Bogus Sports LLC", "CatID": 4, "CatGroup": "Sports", "CatName": "NBA", "CatDesc": "National Basketball Association" } { "CatID": 5, "CatGroup": "Shows", "CatName": "Musicals", "CatDesc": "All symphony, concerto, and choir concerts" }

To load from the JSON data file in the previous example, run the following COPY command.

copy category from 's3://amzn-s3-demo-bucket/category_object_auto ignorecase.json' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' json 'auto ignorecase';

Load from JSON data using a JSONPaths file

If the JSON data objects don't correspond directly to column names, you can use a JSONPaths file to map the JSON elements to columns. The order doesn't matter in the JSON source data, but the order of the JSONPaths file expressions must match the column order. Suppose that you have the following data file, named category_object_paths.json.

{ "one": 1, "two": "Sports", "three": "MLB", "four": "Major League Baseball" } { "three": "NHL", "four": "National Hockey League", "one": 2, "two": "Sports" } { "two": "Sports", "three": "NFL", "one": 3, "four": "National Football League" } { "one": 4, "two": "Sports", "three": "NBA", "four": "National Basketball Association" } { "one": 6, "two": "Shows", "three": "Musicals", "four": "All symphony, concerto, and choir concerts" }

The following JSONPaths file, named category_jsonpath.json, maps the source data to the table columns.

{ "jsonpaths": [ "$['one']", "$['two']", "$['three']", "$['four']" ] }

To load from the JSON data file in the previous example, run the following COPY command.

copy category from 's3://amzn-s3-demo-bucket/category_object_paths.json' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' json 's3://amzn-s3-demo-bucket/category_jsonpath.json';

Load from JSON arrays using a JSONPaths file

To load from JSON data that consists of a set of arrays, you must use a JSONPaths file to map the array elements to columns. Suppose that you have the following data file, named category_array_data.json.

[1,"Sports","MLB","Major League Baseball"] [2,"Sports","NHL","National Hockey League"] [3,"Sports","NFL","National Football League"] [4,"Sports","NBA","National Basketball Association"] [5,"Concerts","Classical","All symphony, concerto, and choir concerts"]

The following JSONPaths file, named category_array_jsonpath.json, maps the source data to the table columns.

{ "jsonpaths": [ "$[0]", "$[1]", "$[2]", "$[3]" ] }

To load from the JSON data file in the previous example, run the following COPY command.

copy category from 's3://amzn-s3-demo-bucket/category_array_data.json' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' json 's3://amzn-s3-demo-bucket/category_array_jsonpath.json';

Copy from Avro examples

In the following examples, you load the CATEGORY table with the following data.

CATID CATGROUP CATNAME CATDESC
1 Sports MLB Major League Baseball
2 Sports NHL National Hockey League
3 Sports NFL National Football League
4 Sports NBA National Basketball Association
5 Concerts Classical All symphony, concerto, and choir concerts

Load from Avro data using the 'auto' option

To load from Avro data using the 'auto' argument, field names in the Avro schema must match the column names. When using the 'auto' argument, order doesn't matter. The following shows the schema for a file named category_auto.avro.

{
    "name": "category",
    "type": "record",
    "fields": [
        {"name": "catid", "type": "int"},
        {"name": "catdesc", "type": "string"},
        {"name": "catname", "type": "string"},
        {"name": "catgroup", "type": "string"},
}

The data in an Avro file is in binary format, so it isn't human-readable. The following shows a JSON representation of the data in the category_auto.avro file.

{
   "catid": 1,
   "catdesc": "Major League Baseball",
   "catname": "MLB",
   "catgroup": "Sports"
}
{
   "catid": 2,
   "catdesc": "National Hockey League",
   "catname": "NHL",
   "catgroup": "Sports"
}
{
   "catid": 3,
   "catdesc": "National Basketball Association",
   "catname": "NBA",
   "catgroup": "Sports"
}
{
   "catid": 4,
   "catdesc": "All symphony, concerto, and choir concerts",
   "catname": "Classical",
   "catgroup": "Concerts"
}

To load from the Avro data file in the previous example, run the following COPY command.

copy category from 's3://amzn-s3-demo-bucket/category_auto.avro' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' format as avro 'auto';

Load from Avro data using the 'auto ignorecase' option

To load from Avro data using the 'auto ignorecase' argument, the case of the field names in the Avro schema does not have to match the case of column names. When using the 'auto ignorecase' argument, order doesn't matter. The following shows the schema for a file named category_auto-ignorecase.avro.

{
    "name": "category",
    "type": "record",
    "fields": [
        {"name": "CatID", "type": "int"},
        {"name": "CatDesc", "type": "string"},
        {"name": "CatName", "type": "string"},
        {"name": "CatGroup", "type": "string"},
}

The data in an Avro file is in binary format, so it isn't human-readable. The following shows a JSON representation of the data in the category_auto-ignorecase.avro file.

{
   "CatID": 1,
   "CatDesc": "Major League Baseball",
   "CatName": "MLB",
   "CatGroup": "Sports"
}
{
   "CatID": 2,
   "CatDesc": "National Hockey League",
   "CatName": "NHL",
   "CatGroup": "Sports"
}
{
   "CatID": 3,
   "CatDesc": "National Basketball Association",
   "CatName": "NBA",
   "CatGroup": "Sports"
}
{
   "CatID": 4,
   "CatDesc": "All symphony, concerto, and choir concerts",
   "CatName": "Classical",
   "CatGroup": "Concerts"
}

To load from the Avro data file in the previous example, run the following COPY command.

copy category from 's3://amzn-s3-demo-bucket/category_auto-ignorecase.avro' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' format as avro 'auto ignorecase';

Load from Avro data using a JSONPaths file

If the field names in the Avro schema don't correspond directly to column names, you can use a JSONPaths file to map the schema elements to columns. The order of the JSONPaths file expressions must match the column order.

Suppose that you have a data file named category_paths.avro that contains the same data as in the previous example, but with the following schema.

{
    "name": "category",
    "type": "record",
    "fields": [
        {"name": "id", "type": "int"},
        {"name": "desc", "type": "string"},
        {"name": "name", "type": "string"},
        {"name": "group", "type": "string"},
        {"name": "region", "type": "string"} 
     ]
}

The following JSONPaths file, named category_path.avropath, maps the source data to the table columns.

{
    "jsonpaths": [
        "$['id']",
        "$['group']",
        "$['name']",
        "$['desc']"
    ]
}

To load from the Avro data file in the previous example, run the following COPY command.

copy category from 's3://amzn-s3-demo-bucket/category_object_paths.avro' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' format avro 's3://amzn-s3-demo-bucket/category_path.avropath ';

Preparing files for COPY with the ESCAPE option

The following example describes how you might prepare data to "escape" newline characters before importing the data into an Amazon Redshift table using the COPY command with the ESCAPE parameter. Without preparing the data to delimit the newline characters, Amazon Redshift returns load errors when you run the COPY command, because the newline character is normally used as a record separator.

For example, consider a file or a column in an external table that you want to copy into an Amazon Redshift table. If the file or column contains XML-formatted content or similar data, you need to make sure that all of the newline characters (\n) that are part of the content are escaped with the backslash character (\).

A file or table containing embedded newlines characters provides a relatively easy pattern to match. Each embedded newline character most likely always follows a > character with potentially some white space characters (' ' or tab) in between, as you can see in the following example of a text file named nlTest1.txt.

$ cat nlTest1.txt <xml start> <newline characters provide> <line breaks at the end of each> <line in content> </xml>|1000 <xml> </xml>|2000

With the following example, you can run a text-processing utility to pre-process the source file and insert escape characters where needed. (The | character is intended to be used as delimiter to separate column data when copied into an Amazon Redshift table.)

$ sed -e ':a;N;$!ba;s/>[[:space:]]*\n/>\\\n/g' nlTest1.txt > nlTest2.txt

Similarly, you can use Perl to perform a similar operation:

cat nlTest1.txt | perl -p -e 's/>\s*\n/>\\\n/g' > nlTest2.txt

To accommodate loading the data from the nlTest2.txt file into Amazon Redshift, we created a two-column table in Amazon Redshift. The first column c1, is a character column that holds XML-formatted content from the nlTest2.txt file. The second column c2 holds integer values loaded from the same file.

After running the sed command, you can correctly load data from the nlTest2.txt file into an Amazon Redshift table using the ESCAPE parameter.

Note

When you include the ESCAPE parameter with the COPY command, it escapes a number of special characters that include the backslash character (including newline).

copy t2 from 's3://amzn-s3-demo-bucket/data/nlTest2.txt' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' escape delimiter as '|'; select * from t2 order by 2; c1 | c2 -------------+------ <xml start> <newline characters provide> <line breaks at the end of each> <line in content> </xml> | 1000 <xml> </xml> | 2000 (2 rows)

You can prepare data files exported from external databases in a similar way. For example, with an Oracle database, you can use the REPLACE function on each affected column in a table that you want to copy into Amazon Redshift.

SELECT c1, REPLACE(c2, \n',\\n' ) as c2 from my_table_with_xml

In addition, many database export and extract, transform, load (ETL) tools that routinely process large amounts of data provide options to specify escape and delimiter characters.

Loading a shapefile into Amazon Redshift

The following examples demonstrate how to load an Esri shapefile using COPY. For more information about loading shapefiles, see Loading a shapefile into Amazon Redshift.

Loading a shapefile

The following steps show how to ingest OpenStreetMap data from Amazon S3 using the COPY command. This example assumes that the Norway shapefile archive from the download site of Geofabrik has been uploaded to a private Amazon S3 bucket in your AWS Region. The .shp, .shx, and .dbf files must share the same Amazon S3 prefix and file name.

Ingesting data without simplification

The following commands create tables and ingest data that can fit in the maximum geometry size without any simplification. Open the gis_osm_natural_free_1.shp in your preferred GIS software and inspect the columns in this layer. By default, either IDENTITY or GEOMETRY columns are first. When a GEOMETRY column is first, you can create the table as shown following.

CREATE TABLE norway_natural ( wkb_geometry GEOMETRY, osm_id BIGINT, code INT, fclass VARCHAR, name VARCHAR);

Or, when an IDENTITY column is first, you can create the table as shown following.

CREATE TABLE norway_natural_with_id ( fid INT IDENTITY(1,1), wkb_geometry GEOMETRY, osm_id BIGINT, code INT, fclass VARCHAR, name VARCHAR);

Now you can ingest the data using COPY.

COPY norway_natural FROM 's3://bucket_name/shapefiles/norway/gis_osm_natural_free_1.shp' FORMAT SHAPEFILE CREDENTIALS 'aws_iam_role=arn:aws:iam::123456789012:role/MyRoleName'; INFO: Load into table 'norway_natural' completed, 83891 record(s) loaded successfully

Or you can ingest the data as shown following.

COPY norway_natural_with_id FROM 's3://bucket_name/shapefiles/norway/gis_osm_natural_free_1.shp' FORMAT SHAPEFILE CREDENTIALS 'aws_iam_role=arn:aws:iam::123456789012:role/MyRoleName'; INFO: Load into table 'norway_natural_with_id' completed, 83891 record(s) loaded successfully.

Ingesting data with simplification

The following commands create a table and try to ingest data that can't fit in the maximum geometry size without any simplification. Inspect the gis_osm_water_a_free_1.shp shapefile and create the appropriate table as shown following.

CREATE TABLE norway_water ( wkb_geometry GEOMETRY, osm_id BIGINT, code INT, fclass VARCHAR, name VARCHAR);

When the COPY command runs, it results in an error.

COPY norway_water FROM 's3://bucket_name/shapefiles/norway/gis_osm_water_a_free_1.shp' FORMAT SHAPEFILE CREDENTIALS 'aws_iam_role=arn:aws:iam::123456789012:role/MyRoleName'; ERROR: Load into table 'norway_water' failed. Check 'stl_load_errors' system table for details.

Querying STL_LOAD_ERRORS shows that the geometry is too large.

SELECT line_number, btrim(colname), btrim(err_reason) FROM stl_load_errors WHERE query = pg_last_copy_id(); line_number | btrim | btrim -------------+--------------+----------------------------------------------------------------------- 1184705 | wkb_geometry | Geometry size: 1513736 is larger than maximum supported size: 1048447

To overcome this, the SIMPLIFY AUTO parameter is added to the COPY command to simplify geometries.

COPY norway_water FROM 's3://bucket_name/shapefiles/norway/gis_osm_water_a_free_1.shp' FORMAT SHAPEFILE SIMPLIFY AUTO CREDENTIALS 'aws_iam_role=arn:aws:iam::123456789012:role/MyRoleName'; INFO: Load into table 'norway_water' completed, 1989196 record(s) loaded successfully.

To view the rows and geometries that were simplified, query SVL_SPATIAL_SIMPLIFY.

SELECT * FROM svl_spatial_simplify WHERE query = pg_last_copy_id(); query | line_number | maximum_tolerance | initial_size | simplified | final_size | final_tolerance -------+-------------+-------------------+--------------+------------+------------+---------------------- 20 | 1184704 | -1 | 1513736 | t | 1008808 | 1.276386653895e-05 20 | 1664115 | -1 | 1233456 | t | 1023584 | 6.11707814796635e-06

Using SIMPLIFY AUTO max_tolerance with the tolerance lower than the automatically calculated ones probably results in an ingestion error. In this case, use MAXERROR to ignore errors.

COPY norway_water FROM 's3://bucket_name/shapefiles/norway/gis_osm_water_a_free_1.shp' FORMAT SHAPEFILE SIMPLIFY AUTO 1.1E-05 MAXERROR 2 CREDENTIALS 'aws_iam_role=arn:aws:iam::123456789012:role/MyRoleName'; INFO: Load into table 'norway_water' completed, 1989195 record(s) loaded successfully. INFO: Load into table 'norway_water' completed, 1 record(s) could not be loaded. Check 'stl_load_errors' system table for details.

Query SVL_SPATIAL_SIMPLIFY again to identify the record that COPY didn't manage to load.

SELECT * FROM svl_spatial_simplify WHERE query = pg_last_copy_id(); query | line_number | maximum_tolerance | initial_size | simplified | final_size | final_tolerance -------+-------------+-------------------+--------------+------------+------------+----------------- 29 | 1184704 | 1.1e-05 | 1513736 | f | 0 | 0 29 | 1664115 | 1.1e-05 | 1233456 | t | 794432 | 1.1e-05

In this example, the first record didn’t manage to fit, so the simplified column is showing false. The second record was loaded within the given tolerance. However, the final size is larger than using the automatically calculated tolerance without specifying the maximum tolerance.

Loading from a compressed shapefile

Amazon Redshift COPY supports ingesting data from a compressed shapefile. All shapefile components must have the same Amazon S3 prefix and the same compression suffix. As an example, suppose that you want to load the data from the previous example. In this case, the files gis_osm_water_a_free_1.shp.gz, gis_osm_water_a_free_1.dbf.gz, and gis_osm_water_a_free_1.shx.gz must share the same Amazon S3 directory. The COPY command requires the GZIP option, and the FROM clause must specify the correct compressed file, as shown following.

COPY norway_natural FROM 's3://bucket_name/shapefiles/norway/compressed/gis_osm_natural_free_1.shp.gz' FORMAT SHAPEFILE GZIP CREDENTIALS 'aws_iam_role=arn:aws:iam::123456789012:role/MyRoleName'; INFO: Load into table 'norway_natural' completed, 83891 record(s) loaded successfully.

Loading data into a table with a different column order

If you have a table that doesn't have GEOMETRY as the first column, you can use column mapping to map columns to the target table. For example, create a table with osm_id specified as a first column.

CREATE TABLE norway_natural_order ( osm_id BIGINT, wkb_geometry GEOMETRY, code INT, fclass VARCHAR, name VARCHAR);

Then ingest a shapefile using column mapping.

COPY norway_natural_order(wkb_geometry, osm_id, code, fclass, name) FROM 's3://bucket_name/shapefiles/norway/gis_osm_natural_free_1.shp' FORMAT SHAPEFILE CREDENTIALS 'aws_iam_role=arn:aws:iam::123456789012:role/MyRoleName'; INFO: Load into table 'norway_natural_order' completed, 83891 record(s) loaded successfully.

Loading data into a table with a geography column

If you have a table that has a GEOGRAPHY column, you first ingest into a GEOMETRY column and then cast the objects to GEOGRAPHY objects. For example, after you copy your shapefile into a GEOMETRY column, alter the table to add a column of the GEOGRAPHY data type.

ALTER TABLE norway_natural ADD COLUMN wkb_geography GEOGRAPHY;

Then convert geometries to geographies.

UPDATE norway_natural SET wkb_geography = wkb_geometry::geography;

Optionally, you can drop the GEOMETRY column.

ALTER TABLE norway_natural DROP COLUMN wkb_geometry;

COPY command with the NOLOAD option

To validate data files before you actually load the data, use the NOLOAD option with the COPY command. Amazon Redshift parses the input file and displays any errors that occur. The following example uses the NOLOAD option and no rows are actually loaded into the table.

COPY public.zipcode1 FROM 's3://amzn-s3-demo-bucket/mydata/zipcode.csv' DELIMITER ';' IGNOREHEADER 1 REGION 'us-east-1' NOLOAD CREDENTIALS 'aws_iam_role=arn:aws:iam::123456789012:role/myRedshiftRole'; Warnings: Load into table 'zipcode1' completed, 0 record(s) loaded successfully.

COPY command with a multibyte delimiter and the ENCODING option

The following example loads LATIN1 from an Amazon S3 file that contains multibyte data. The COPY command specifies the delimiter in octal form \302\246\303\254 to separate the fields in the input file which is encoded as ISO-8859-1. To specify the same delimiter in UTF-8, specify DELIMITER '¦ì'.

COPY latin1 FROM 's3://amzn-s3-demo-bucket/multibyte/myfile' IAM_ROLE 'arn:aws:iam::123456789012:role/myRedshiftRole' DELIMITER '\302\246\303\254' ENCODING ISO88591