Using S3 Select with Hive to improve performance
Important
Amazon S3 Select is no longer available to new customers. Existing customers of Amazon S3 Select can continue to use
the feature as usual. Learn more
With Amazon EMR release version 5.18.0 and later, you can use S3 Select
S3 Select is supported with Hive tables based on CSV and JSON files and by setting
the s3select.filter
configuration variable to true
during
your Hive session. For more information and examples, see Specifying S3 Select in your
code.
Is S3 Select right for my application?
We recommend that you benchmark your applications with and without S3 Select to see if using it may be suitable for your application.
Use the following guidelines to determine if your application is a candidate for using S3 Select:
-
Your query filters out more than half of the original dataset.
-
Your query filter predicates use columns that have a data type supported by Amazon S3 Select. For more information, see Data types in the Amazon Simple Storage Service User Guide.
-
Your network connection between Amazon S3 and the Amazon EMR cluster has good transfer speed and available bandwidth. Amazon S3 does not compress HTTP responses, so the response size is likely to increase for compressed input files.
Considerations and limitations
-
Amazon S3 server-side encryption with customer-provided encryption keys (SSE-C) and client-side encryption are not supported.
-
The
AllowQuotedRecordDelimiters
property is not supported. If this property is specified, the query fails. -
Only CSV and JSON files in UTF-8 format are supported. Multi-line CSVs and JSON are not supported.
-
Only uncompressed or gzip or bzip2 files are supported.
-
Comment characters in the last line are not supported.
-
Empty lines at the end of a file are not processed.
-
Hive on Amazon EMR supports the primitive data types that S3 Select supports. For more information, see Data types in the Amazon Simple Storage Service User Guide.
Specifying S3 Select in your code
To use S3 Select in your Hive table, create the table by specifying
com.amazonaws.emr.s3select.hive.S3SelectableTextInputFormat
as
the INPUTFORMAT
class name, and specify a value for the
s3select.format
property using the TBLPROPERTIES
clause.
By default, S3 Select is disabled when you run queries. Enable S3 Select by
setting s3select.filter
to true
in your Hive session
as shown below. The examples below demonstrate how to specify S3 Select when
creating a table from underlying CSV and JSON files and then querying the table
using a simple select statement.
Example CREATE TABLE statement for CSV-based table
CREATE TABLE mys3selecttable ( col1 string, col2 int, col3 boolean ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS INPUTFORMAT 'com.amazonaws.emr.s3select.hive.S3SelectableTextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://path/to/mycsvfile/' TBLPROPERTIES ( "s3select.format" = "csv", "s3select.headerInfo" = "ignore" );
Example CREATE TABLE statement for JSON-based table
CREATE TABLE mys3selecttable ( col1 string, col2 int, col3 boolean ) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS INPUTFORMAT 'com.amazonaws.emr.s3select.hive.S3SelectableTextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://path/to/json/' TBLPROPERTIES ( "s3select.format" = "json" );
Example SELECT TABLE statement
SET s3select.filter=true; SELECT * FROM mys3selecttable WHERE col2 > 10;