Work with CSV data in AWS Glue
This page describes how to use AWS Glue to create schema from CSV files that have quotes around the data values for each column or from CSV files that include header values.
Handling CSV data enclosed in quotes
Suppose a CSV file has data fields enclosed in double quotes, as in the following example.
"John","Doe","123-555-1231","John said \"hello\""
"Jane","Doe","123-555-9876","Jane said \"hello\""
To run a query in Athena on a table created from a CSV file that has quoted values, you must modify the table properties in AWS Glue to use the OpenCSVSerDe. For more information about the OpenCSV SerDe, see Open CSV SerDe for processing CSV.
To edit table properties in the AWS Glue console
-
In the AWS Glue console navigation pane, choose Tables.
-
Choose the link for the table that you want to edit, and then choose Actions, Edit table.
-
On the Edit table page, make the following changes:
-
For Serialization lib, enter
org.apache.hadoop.hive.serde2.OpenCSVSerde
. -
For Serde parameters, enter the following values for the keys
escapeChar
,quoteChar
, andseparatorChar
:-
For
escapeChar
, enter a backslash (\
). -
For
quoteChar
, enter a double quote ("
). -
For
separatorChar
, enter a comma (,
).
-
-
-
Choose Save.
For more information, see Viewing and editing table details in the AWS Glue Developer Guide.
You can also update AWS Glue table properties programmatically. Use the AWS Glue UpdateTable API operation or the update-table AWS CLI command
to modify the SerDeInfo
block in the table definition, as in the following
JSON example.
"SerDeInfo": {
"name": "",
"serializationLib": "org.apache.hadoop.hive.serde2.OpenCSVSerde",
"parameters": {
"separatorChar": ","
"quoteChar": "\""
"escapeChar": "\\"
}
},
Handling CSV files with headers
When you define a table in Athena with a CREATE TABLE
statement, you can
use the skip.header.line.count
table property to ignore headers in your CSV
data, as in the following example.
... STORED AS TEXTFILE LOCATION 's3://amzn-s3-demo-bucket/
csvdata_folder
/'; TBLPROPERTIES ("skip.header.line.count"="1")
Alternatively, you can remove the CSV headers beforehand so that the header information is not included in Athena query results. One way to achieve this is to use AWS Glue jobs, which perform extract, transform, and load (ETL) work. You can write scripts in AWS Glue using a language that is an extension of the PySpark Python dialect. For more information, see Authoring Jobs in AWS Glue in the AWS Glue Developer Guide.
The following example shows a function in an AWS Glue script that writes out a dynamic
frame using from_options
, and sets the writeHeader
format
option to false, which removes the header information:
glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": "s3://amzn-s3-demo-bucket/MYTABLEDATA/"}, format = "csv", format_options = {"writeHeader": False}, transformation_ctx = "datasink2")