Work with CSV data in AWS Glue

Focus mode

Work with CSV data in AWS Glue - Amazon Athena

Handling CSV data enclosed in quotes Handling CSV files with headers

This page describes how to use AWS Glue to create schema from CSV files that have quotes around the data values for each column or from CSV files that include header values.

Handling CSV data enclosed in quotes

Suppose a CSV file has data fields enclosed in double quotes, as in the following example.


"John","Doe","123-555-1231","John said \"hello\""
"Jane","Doe","123-555-9876","Jane said \"hello\""

To run a query in Athena on a table created from a CSV file that has quoted values, you must modify the table properties in AWS Glue to use the OpenCSVSerDe. For more information about the OpenCSV SerDe, see Open CSV SerDe for processing CSV.

To edit table properties in the AWS Glue console

In the AWS Glue console navigation pane, choose Tables.
Choose the link for the table that you want to edit, and then choose Actions, Edit table.
On the Edit table page, make the following changes:
- For Serialization lib, enter org.apache.hadoop.hive.serde2.OpenCSVSerde.
- For Serde parameters, enter the following values for the keys escapeChar, quoteChar, and separatorChar:
  - For escapeChar, enter a backslash (\).
  - For quoteChar, enter a double quote (").
  - For separatorChar, enter a comma (,).
Choose Save.

For more information, see Viewing and editing table details in the AWS Glue Developer Guide.

You can also update AWS Glue table properties programmatically. Use the AWS Glue UpdateTable API operation or the update-table AWS CLI command to modify the SerDeInfo block in the table definition, as in the following JSON example.


"SerDeInfo": {
   "name": "",
   "serializationLib": "org.apache.hadoop.hive.serde2.OpenCSVSerde",
   "parameters": {
      "separatorChar": ","
      "quoteChar": "\""
      "escapeChar": "\\"
      }
},

Handling CSV files with headers

When you define a table in Athena with a CREATE TABLE statement, you can use the skip.header.line.count table property to ignore headers in your CSV data, as in the following example.


...
STORED AS TEXTFILE
LOCATION 's3://amzn-s3-demo-bucket/csvdata_folder/';
TBLPROPERTIES ("skip.header.line.count"="1")

Alternatively, you can remove the CSV headers beforehand so that the header information is not included in Athena query results. One way to achieve this is to use AWS Glue jobs, which perform extract, transform, and load (ETL) work. You can write scripts in AWS Glue using a language that is an extension of the PySpark Python dialect. For more information, see Authoring Jobs in AWS Glue in the AWS Glue Developer Guide.

The following example shows a function in an AWS Glue script that writes out a dynamic frame using from_options, and sets the writeHeader format option to false, which removes the header information:


glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": "s3://amzn-s3-demo-bucket/MYTABLEDATA/"}, format = "csv", format_options = {"writeHeader": False}, transformation_ctx = "datasink2")

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Create tables for ETL jobs

Work with geospatial data

Next topic:

Work with geospatial data

Previous topic:

Create tables for ETL jobs

Select your cookie preferences

Customize cookie preferences

Essential

Performance

Functional

Advertising

Unable to save cookie preferences

Work with CSV data in AWS Glue

Handling CSV data enclosed in quotes

To edit table properties in the AWS Glue console

Handling CSV files with headers

Next topic:

Previous topic:

Need help?

On this page

Did this page help you?