예시 - Amazon Redshift

예시

다음 예에서는 spectrum이라는 Amazon Redshift 외부 스키마에서 SALES로 명명된 테이블을 생성합니다. 데이터는 탭으로 구분된 텍스트 파일입니다. TABLE PROPERTIES 절은 numRows 속성을 170,000개 행으로 설정합니다.

CREATE EXTERNAL TABLE을 실행할 때 사용하는 자격 증명에 따라 구성해야 하는 IAM 권한이 있을 수 있습니다. 가장 좋은 방법은 권한 정책을 IAM 역할에 연결한 다음 필요에 따라 사용자 및 그룹에 할당하는 것입니다. 자세한 내용은 Amazon Redshift의 Identity and Access Management를 참조하세요.

create external table spectrum.sales( salesid integer, listid integer, sellerid integer, buyerid integer, eventid integer, saledate date, qtysold smallint, pricepaid decimal(8,2), commission decimal(8,2), saletime timestamp) row format delimited fields terminated by '\t' stored as textfile location 's3://redshift-downloads/tickit/spectrum/sales/' table properties ('numRows'='170000');

다음 예에서는 JsonSerDe를 사용하여 JSON 형식의 데이터를 참조하는 테이블을 생성합니다.

create external table spectrum.cloudtrail_json ( event_version int, event_id bigint, event_time timestamp, event_type varchar(10), awsregion varchar(20), event_name varchar(max), event_source varchar(max), requesttime timestamp, useragent varchar(max), recipientaccountid bigint) row format serde 'org.openx.data.jsonserde.JsonSerDe' with serdeproperties ( 'dots.in.keys' = 'true', 'mapping.requesttime' = 'requesttimestamp' ) location 's3://amzn-s3-demo-bucket/json/cloudtrail';

다음 CREATE EXTERNAL TABLE AS 예는 분할되지 않은 외부 테이블을 생성합니다. 그런 다음 대상 Amazon S3 위치에 Apache Parquet로 SELECT 쿼리의 결과를 작성합니다.

CREATE EXTERNAL TABLE spectrum.lineitem STORED AS parquet LOCATION 'S3://amzn-s3-demo-bucket/cetas/lineitem/' AS SELECT * FROM local_lineitem;

다음 예에서는 분할된 외부 테이블을 생성하고 SELECT 쿼리에 파티션 열을 포함합니다.

CREATE EXTERNAL TABLE spectrum.partitioned_lineitem PARTITIONED BY (l_shipdate, l_shipmode) STORED AS parquet LOCATION 'S3://amzn-s3-demo-bucket/cetas/partitioned_lineitem/' AS SELECT l_orderkey, l_shipmode, l_shipdate, l_partkey FROM local_table;

외부 데이터 카탈로그에 있는 기존 데이터베이스의 목록을 보려면 SVV_EXTERNAL_DATABASES 시스템 뷰를 쿼리하십시오.

select eskind,databasename,esoptions from svv_external_databases order by databasename;
eskind | databasename | esoptions -------+--------------+---------------------------------------------------------------------------------- 1 | default | {"REGION":"us-west-2","IAM_ROLE":"arn:aws:iam::123456789012:role/mySpectrumRole"} 1 | sampledb | {"REGION":"us-west-2","IAM_ROLE":"arn:aws:iam::123456789012:role/mySpectrumRole"} 1 | spectrumdb | {"REGION":"us-west-2","IAM_ROLE":"arn:aws:iam::123456789012:role/mySpectrumRole"}

외부 테이블의 세부 정보를 보려면 SVV_EXTERNAL_TABLESSVV_EXTERNAL_COLUMNS 시스템 뷰를 쿼리하십시오.

다음 예에서는 SVV_EXTERNAL_TABLES 뷰를 쿼리합니다.

select schemaname, tablename, location from svv_external_tables;
schemaname | tablename | location -----------+----------------------+-------------------------------------------------------- spectrum | sales | s3://redshift-downloads/tickit/spectrum/sales spectrum | sales_part | s3://redshift-downloads/tickit/spectrum/sales_partition

다음 예에서는 SVV_EXTERNAL_COLUMNS 뷰를 쿼리합니다.

select * from svv_external_columns where schemaname like 'spectrum%' and tablename ='sales';
schemaname | tablename | columnname | external_type | columnnum | part_key -----------+-----------+------------+---------------+-----------+--------- spectrum | sales | salesid | int | 1 | 0 spectrum | sales | listid | int | 2 | 0 spectrum | sales | sellerid | int | 3 | 0 spectrum | sales | buyerid | int | 4 | 0 spectrum | sales | eventid | int | 5 | 0 spectrum | sales | saledate | date | 6 | 0 spectrum | sales | qtysold | smallint | 7 | 0 spectrum | sales | pricepaid | decimal(8,2) | 8 | 0 spectrum | sales | commission | decimal(8,2) | 9 | 0 spectrum | sales | saletime | timestamp | 10 | 0

테이블 파티션을 보려면 다음 쿼리를 사용하십시오.

select schemaname, tablename, values, location from svv_external_partitions where tablename = 'sales_part';
schemaname | tablename | values | location -----------+------------+----------------+------------------------------------------------------------------------- spectrum | sales_part | ["2008-01-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-01 spectrum | sales_part | ["2008-02-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-02 spectrum | sales_part | ["2008-03-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-03 spectrum | sales_part | ["2008-04-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-04 spectrum | sales_part | ["2008-05-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-05 spectrum | sales_part | ["2008-06-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-06 spectrum | sales_part | ["2008-07-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-07 spectrum | sales_part | ["2008-08-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-08 spectrum | sales_part | ["2008-09-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-09 spectrum | sales_part | ["2008-10-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-10 spectrum | sales_part | ["2008-11-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-11 spectrum | sales_part | ["2008-12-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-12

다음은 외부 테이블의 관련 데이터 파일의 총 크기를 반환하는 예입니다.

select distinct "$path", "$size" from spectrum.sales_part; $path | $size --------------------------------------------------------------------------+------- s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-01/ | 1616 s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-02/ | 1444 s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-02/ | 1444

파티셔닝 예

날짜 기준으로 파티셔닝된 외부 테이블을 만들려면 다음 명령을 실행합니다.

create external table spectrum.sales_part( salesid integer, listid integer, sellerid integer, buyerid integer, eventid integer, dateid smallint, qtysold smallint, pricepaid decimal(8,2), commission decimal(8,2), saletime timestamp) partitioned by (saledate date) row format delimited fields terminated by '|' stored as textfile location 's3://redshift-downloads/tickit/spectrum/sales_partition/' table properties ('numRows'='170000');

파티션을 추가하려면 다음 ALTER TABLE 명령을 실행합니다.

alter table spectrum.sales_part add if not exists partition (saledate='2008-01-01') location 's3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-01/'; alter table spectrum.sales_part add if not exists partition (saledate='2008-02-01') location 's3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-02/'; alter table spectrum.sales_part add if not exists partition (saledate='2008-03-01') location 's3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-03/'; alter table spectrum.sales_part add if not exists partition (saledate='2008-04-01') location 's3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-04/'; alter table spectrum.sales_part add if not exists partition (saledate='2008-05-01') location 's3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-05/'; alter table spectrum.sales_part add if not exists partition (saledate='2008-06-01') location 's3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-06/'; alter table spectrum.sales_part add if not exists partition (saledate='2008-07-01') location 's3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-07/'; alter table spectrum.sales_part add if not exists partition (saledate='2008-08-01') location 's3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-08/'; alter table spectrum.sales_part add if not exists partition (saledate='2008-09-01') location 's3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-09/'; alter table spectrum.sales_part add if not exists partition (saledate='2008-10-01') location 's3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-10/'; alter table spectrum.sales_part add if not exists partition (saledate='2008-11-01') location 's3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-11/'; alter table spectrum.sales_part add if not exists partition (saledate='2008-12-01') location 's3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-12/';

파티셔닝된 테이블에서 데이터를 선택하려면 다음 쿼리를 실행합니다.

select top 10 spectrum.sales_part.eventid, sum(spectrum.sales_part.pricepaid) from spectrum.sales_part, event where spectrum.sales_part.eventid = event.eventid and spectrum.sales_part.pricepaid > 30 and saledate = '2008-12-01' group by spectrum.sales_part.eventid order by 2 desc;
eventid | sum --------+--------- 914 | 36173.00 5478 | 27303.00 5061 | 26383.00 4406 | 26252.00 5324 | 24015.00 1829 | 23911.00 3601 | 23616.00 3665 | 23214.00 6069 | 22869.00 5638 | 22551.00

외부 테이블 파티션을 보려면 SVV_EXTERNAL_PARTITIONS 시스템 뷰를 쿼리하십시오.

select schemaname, tablename, values, location from svv_external_partitions where tablename = 'sales_part';
schemaname | tablename | values | location -----------+------------+----------------+-------------------------------------------------- spectrum | sales_part | ["2008-01-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-01 spectrum | sales_part | ["2008-02-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-02 spectrum | sales_part | ["2008-03-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-03 spectrum | sales_part | ["2008-04-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-04 spectrum | sales_part | ["2008-05-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-05 spectrum | sales_part | ["2008-06-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-06 spectrum | sales_part | ["2008-07-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-07 spectrum | sales_part | ["2008-08-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-08 spectrum | sales_part | ["2008-09-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-09 spectrum | sales_part | ["2008-10-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-10 spectrum | sales_part | ["2008-11-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-11 spectrum | sales_part | ["2008-12-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-12

행 형식 예

다음은 AVRO 형식으로 저장된 데이터 파일에 대해 ROW FORMAT SERDE 파라미터를 지정하는 예입니다.

create external table spectrum.sales(salesid int, listid int, sellerid int, buyerid int, eventid int, dateid int, qtysold int, pricepaid decimal(8,2), comment VARCHAR(255)) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' WITH SERDEPROPERTIES ('avro.schema.literal'='{\"namespace\": \"dory.sample\",\"name\": \"dory_avro\",\"type\": \"record\", \"fields\": [{\"name\":\"salesid\", \"type\":\"int\"}, {\"name\":\"listid\", \"type\":\"int\"}, {\"name\":\"sellerid\", \"type\":\"int\"}, {\"name\":\"buyerid\", \"type\":\"int\"}, {\"name\":\"eventid\",\"type\":\"int\"}, {\"name\":\"dateid\",\"type\":\"int\"}, {\"name\":\"qtysold\",\"type\":\"int\"}, {\"name\":\"pricepaid\", \"type\": {\"type\": \"bytes\", \"logicalType\": \"decimal\", \"precision\": 8, \"scale\": 2}}, {\"name\":\"comment\",\"type\":\"string\"}]}') STORED AS AVRO location 's3://amzn-s3-demo-bucket/avro/sales' ;

다음은 RegEx를 사용하여 ROW FORMAT SERDE 파라미터를 지정하는 예입니다.

create external table spectrum.types( cbigint bigint, cbigint_null bigint, cint int, cint_null int) row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe' with serdeproperties ('input.regex'='([^\\x01]+)\\x01([^\\x01]+)\\x01([^\\x01]+)\\x01([^\\x01]+)') stored as textfile location 's3://amzn-s3-demo-bucket/regex/types';

다음은 Grok를 사용하여 ROW FORMAT SERDE 파라미터를 지정하는 예입니다.

create external table spectrum.grok_log( timestamp varchar(255), pid varchar(255), loglevel varchar(255), progname varchar(255), message varchar(255)) row format serde 'com.amazonaws.glue.serde.GrokSerDe' with serdeproperties ('input.format'='[DFEWI], \\[%{TIMESTAMP_ISO8601:timestamp} #%{POSINT:pid:int}\\] *(?<loglevel>:DEBUG|FATAL|ERROR|WARN|INFO) -- +%{DATA:progname}: %{GREEDYDATA:message}') stored as textfile location 's3://DOC-EXAMPLE-BUCKET/grok/logs';

다음은 S3 버킷에서 Amazon S3 서버 액세스 로그를 정의하는 예입니다. Redshift Spectrum을 사용하여 Amazon S3 액세스 로그를 쿼리할 수 있습니다.

CREATE EXTERNAL TABLE spectrum.mybucket_s3_logs( bucketowner varchar(255), bucket varchar(255), requestdatetime varchar(2000), remoteip varchar(255), requester varchar(255), requested varchar(255), operation varchar(255), key varchar(255), requesturi_operation varchar(255), requesturi_key varchar(255), requesturi_httpprotoversion varchar(255), httpstatus varchar(255), errorcode varchar(255), bytessent bigint, objectsize bigint, totaltime varchar(255), turnaroundtime varchar(255), referrer varchar(255), useragent varchar(255), versionid varchar(255) ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( 'input.regex' = '([^ ]*) ([^ ]*) \\[(.*?)\\] ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) \"([^ ]*)\\s*([^ ]*)\\s*([^ ]*)\" (- |[^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\") ([^ ]*).*$') LOCATION 's3://amzn-s3-demo-bucket/s3logs’;

다음은 ION 형식 데이터에 대해 ROW FORMAT SERDE 파라미터를 지정하는 예입니다.

CREATE EXTERNAL TABLE tbl_name (columns) ROW FORMAT SERDE 'com.amazon.ionhiveserde.IonHiveSerDe' STORED AS INPUTFORMAT 'com.amazon.ionhiveserde.formats.IonInputFormat' OUTPUTFORMAT 'com.amazon.ionhiveserde.formats.IonOutputFormat' LOCATION 's3://amzn-s3-demo-bucket/prefix'

데이터 처리 예

다음 예제에서는 spi_global_rankings.csv 파일에 액세스합니다. 이 예에서 보여주는 것처럼 spi_global_rankings.csv 파일을 Amazon S3 버킷에 업로드할 수 있습니다.

다음 예에서는 외부 스키마 schema_spectrum_uddhspectrum_db_uddh 데이터베이스를 생성합니다. aws-account-id에 AWS 계정 ID, role-name에 Redshift Spectrum 역할 이름을 입력합니다.

create external schema schema_spectrum_uddh from data catalog database 'spectrum_db_uddh' iam_role 'arn:aws:iam::aws-account-id:role/role-name' create external database if not exists;

다음 예에서는 외부 스키마 schema_spectrum_uddh에서 외부 테이블 soccer_league를 생성합니다.

CREATE EXTERNAL TABLE schema_spectrum_uddh.soccer_league ( league_rank smallint, prev_rank smallint, club_name varchar(15), league_name varchar(20), league_off decimal(6,2), league_def decimal(6,2), league_spi decimal(6,2), league_nspi integer ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n\l' stored as textfile LOCATION 's3://spectrum-uddh/league/' table properties ('skip.header.line.count'='1');

soccer_league 테이블에 포함된 행의 수를 확인합니다.

select count(*) from schema_spectrum_uddh.soccer_league;

행 수가 표시됩니다.

count 645

다음 쿼리는 상위 10개 클럽을 표시합니다. 클럽 Barcelona의 문자열에 잘못된 문자가 있으므로 이름에 NULL이 표시됩니다.

select league_rank,club_name,league_name,league_nspi from schema_spectrum_uddh.soccer_league where league_rank between 1 and 10;
league_rank club_name league_name league_nspi 1 Manchester City Barclays Premier Lea 34595 2 Bayern Munich German Bundesliga 34151 3 Liverpool Barclays Premier Lea 33223 4 Chelsea Barclays Premier Lea 32808 5 Ajax Dutch Eredivisie 32790 6 Atletico Madrid Spanish Primera Divi 31517 7 Real Madrid Spanish Primera Divi 31469 8 NULL Spanish Primera Divi 31321 9 RB Leipzig German Bundesliga 31014 10 Paris Saint-Ger French Ligue 1 30929

다음 예제에서는 invalid_char_handling, replacement_char, data_cleansing_enabled 외부 테이블 속성을 지정하여 예기치 않은 문자의 대체 문자로 물음표(?)를 삽입하도록 soccer_league 테이블을 변경합니다.

alter table schema_spectrum_uddh.soccer_league set table properties ('invalid_char_handling'='REPLACE','replacement_char'='?','data_cleansing_enabled'='true');

다음 예제에서는 soccer_league 테이블에서 순위가 1~10위인 팀을 쿼리합니다.

select league_rank,club_name,league_name,league_nspi from schema_spectrum_uddh.soccer_league where league_rank between 1 and 10;

테이블 속성이 변경되어 결과에 상위 10개 클럽과 Barcelona 클럽의 8번째 행을 대체하는 물음표(?) 문자가 표시됩니다.

league_rank club_name league_name league_nspi 1 Manchester City Barclays Premier Lea 34595 2 Bayern Munich German Bundesliga 34151 3 Liverpool Barclays Premier Lea 33223 4 Chelsea Barclays Premier Lea 32808 5 Ajax Dutch Eredivisie 32790 6 Atletico Madrid Spanish Primera Divi 31517 7 Real Madrid Spanish Primera Divi 31469 8 Barcel?na Spanish Primera Divi 31321 9 RB Leipzig German Bundesliga 31014 10 Paris Saint-Ger French Ligue 1 30929

다음 예제에서는 invalid_char_handling 외부 테이블 속성을 지정하여 예기치 않은 문자가 있는 행을 삭제하도록 soccer_league 테이블을 변경합니다.

alter table schema_spectrum_uddh.soccer_league set table properties ('invalid_char_handling'='DROP_ROW','data_cleansing_enabled'='true');

다음 예제에서는 soccer_league 테이블에서 순위가 1~10위인 팀을 쿼리합니다.

select league_rank,club_name,league_name,league_nspi from schema_spectrum_uddh.soccer_league where league_rank between 1 and 10;

결과에는 상위 클럽이 Barcelona 클럽의 8번째 행은 포함하지 않고 표시됩니다.

league_rank club_name league_name league_nspi 1 Manchester City Barclays Premier Lea 34595 2 Bayern Munich German Bundesliga 34151 3 Liverpool Barclays Premier Lea 33223 4 Chelsea Barclays Premier Lea 32808 5 Ajax Dutch Eredivisie 32790 6 Atletico Madrid Spanish Primera Divi 31517 7 Real Madrid Spanish Primera Divi 31469 9 RB Leipzig German Bundesliga 31014 10 Paris Saint-Ger French Ligue 1 30929