AWS Glue retrieves data from sources and writes data to targets stored and transported in various data formats. If your data is stored or transported in the CSV data format, this document introduces you available features for using your data in AWS Glue.
AWS Glue supports using the comma-separated value (CSV) format. This format is a minimal, row-based data format.
CSVs often don't strictly conform to a standard, but you can refer to RFC 4180
You can use AWS Glue to read CSVs from Amazon S3 and from streaming sources as well as write CSVs to Amazon S3. You can
read and write bzip
and gzip
archives containing CSV files from S3. You configure
compression behavior on the S3 connection parameters instead of in the configuration discussed on this
page.
The following table shows which common AWS Glue features support the CSV format option.
Read | Write | Streaming read | Group small files | Job bookmarks |
---|---|---|---|---|
Supported | Supported | Supported | Supported | Supported |
Example: Read CSV files or folders from
S3
Prerequisites: You will need the S3 paths (s3path
) to the
CSV files or folders that you want to read.
Configuration: In your function options, specify
format="csv"
. In your connection_options
, use the paths
key to
specify s3path
. You can configure how the reader interacts with S3 in
connection_options
. For details, see Connection types and options for ETL in AWS Glue: S3 connection parameters. You can configure how the reader interprets CSV
files in your format_options
. For details, see CSV Configuration Reference.
The following AWS Glue ETL script shows the process of reading CSV files or folders from S3.
We provide a custom CSV reader with performance optimizations for common workflows through the
optimizePerformance
configuration key. To determine if this reader is right for your workload,
see Optimize read performance with
vectorized SIMD CSV reader.
For this example, use the create_dynamic_frame.from_options method.
# Example: Read CSV from S3
# For show, we handle a CSV with a header row. Set the withHeader option.
# Consider whether optimizePerformance is right for your workflow.
from pyspark.context import SparkContext
from awsglue.context import GlueContext
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
dynamicFrame = glueContext.create_dynamic_frame.from_options(
connection_type="s3",
connection_options={"paths": ["s3://s3path
"]},
format="csv",
format_options={
"withHeader": True,
# "optimizePerformance": True,
},
)
You can also use DataFrames in a script (pyspark.sql.DataFrame
).
dataFrame = spark.read\
.format("csv")\
.option("header", "true")\
.load("s3://s3path
")
Example: Write CSV files and folders to
S3
Prerequisites: You will need an initialized DataFrame
(dataFrame
) or a DynamicFrame (dynamicFrame
). You will also need your expected S3
output path, s3path
.
Configuration: In your function options, specify
format="csv"
. In your connection_options
, use the paths
key to
specify s3path
. You can configure how the writer interacts with S3 in
connection_options
. For details, see Connection types and options for ETL in AWS Glue: S3 connection parameters. You can configure how your
operation writes the contents of your files in format_options
. For details, see CSV Configuration Reference. The following
AWS Glue ETL script shows the process of writing CSV files and folders to S3.
For this example, use the write_dynamic_frame.from_options method.
# Example: Write CSV to S3
# For show, customize how we write string type values. Set quoteChar to -1 so our values are not quoted.
from pyspark.context import SparkContext
from awsglue.context import GlueContext
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
glueContext.write_dynamic_frame.from_options(
frame=dynamicFrame,
connection_type="s3",
connection_options={"path": "s3://s3path
"},
format="csv",
format_options={
"quoteChar": -1,
},
)
You can also use DataFrames in a script (pyspark.sql.DataFrame
).
dataFrame.write\
.format("csv")\
.option("quote", None)\
.mode("append")\
.save("s3://s3path
")
CSV configuration reference
You can use the following format_options
wherever AWS Glue libraries specify
format="csv"
:
-
separator
–Specifies the delimiter character. The default is a comma, but any other character can be specified.-
Type: Text, Default:
","
-
-
escaper
– Specifies a character to use for escaping. This option is used only when reading CSV files, not writing. If enabled, the character that immediately follows is used as-is, except for a small set of well-known escapes (\n
,\r
,\t
, and\0
).-
Type: Text, Default: none
-
-
quoteChar
– Specifies the character to use for quoting. The default is a double quote. Set this to-1
to turn off quoting entirely.-
Type: Text, Default:
'"'
-
-
multiLine
– Specifies whether a single record can span multiple lines. This can occur when a field contains a quoted new-line character. You must set this option toTrue
if any record spans multiple lines. EnablingmultiLine
might decrease performance because it requires more cautious file-splitting while parsing.-
Type: Boolean, Default:
false
-
-
withHeader
– Specifies whether to treat the first line as a header. This option can be used in theDynamicFrameReader
class.-
Type: Boolean, Default:
false
-
-
writeHeader
– Specifies whether to write the header to output. This option can be used in theDynamicFrameWriter
class.-
Type: Boolean, Default:
true
-
-
skipFirst
– Specifies whether to skip the first data line.-
Type: Boolean, Default:
false
-
-
optimizePerformance
– Specifies whether to use the advanced SIMD CSV reader along with Apache Arrow–based columnar memory formats. Only available in AWS Glue 3.0+.-
Type: Boolean, Default:
false
-
-
strictCheckForQuoting
– When writing CSVs, Glue may add quotes to values it interprets as strings. This is done to prevent ambiguity in what is written out. To save time when deciding what to write, Glue may quote in certain situations where quotes are not necessary. Enabling a strict check will perform a more intensive computation and will only quote when strictly necessary. Only available in AWS Glue 3.0+.-
Type: Boolean, Default:
false
-
Optimize read performance with
vectorized SIMD CSV reader
AWS Glue version 3.0 adds an optimized CSV reader that can significantly speed up overall job performance compared to row-based CSV readers.
The optimized reader:
-
Uses CPU SIMD instructions to read from disk
-
Immediately writes records to memory in a columnar format (Apache Arrow)
-
Divides the records into batches
This saves processing time when records would be batched or converted to a columnar format later on. Some examples are when changing schemas or retrieving data by column.
To use the optimized reader, set "optimizePerformance"
to true
in the
format_options
or table property.
glueContext.create_dynamic_frame.from_options(
frame = datasource1,
connection_type = "s3",
connection_options = {"paths": ["s3://s3path"]},
format = "csv",
format_options={
"optimizePerformance": True,
"separator": ","
},
transformation_ctx = "datasink2")
Limitations for the vectorized CSV reader
Note the following limitations of the vectorized CSV reader:
-
It doesn't support the
multiLine
andescaper
format options. It uses the defaultescaper
of double quote char'"'
. When these options are set, AWS Glue automatically falls back to using the row-based CSV reader. -
It doesn't support creating a DynamicFrame with ChoiceType.
-
It doesn't support creating a DynamicFrame with error records.
-
It doesn't support reading CSV files with multibyte characters such as Japanese or Chinese characters.