

# Processing DynamoDB data with Apache Hive on Amazon EMR
<a name="EMRforDynamoDB"></a>

Amazon DynamoDB is integrated with Apache Hive, a data warehousing application that runs on Amazon EMR. Hive can read and write data in DynamoDB tables, allowing you to:
+ Query live DynamoDB data using a SQL-like language (HiveQL).
+ Copy data from a DynamoDB table to an Amazon S3 bucket, and vice-versa.
+ Copy data from a DynamoDB table into Hadoop Distributed File System (HDFS), and vice-versa.
+ Perform join operations on DynamoDB tables.

**Topics**
+ [Overview](#EMRforDynamoDB.Overview)
+ [Tutorial: Working with Amazon DynamoDB and Apache Hive](EMRforDynamoDB.Tutorial.md)
+ [Creating an external table in Hive](EMRforDynamoDB.ExternalTableForDDB.md)
+ [Processing HiveQL statements](EMRforDynamoDB.ProcessingHiveQL.md)
+ [Querying data in DynamoDB](EMRforDynamoDB.Querying.md)
+ [Copying data to and from Amazon DynamoDB](EMRforDynamoDB.CopyingData.md)
+ [Performance tuning](EMRforDynamoDB.PerformanceTuning.md)

## Overview
<a name="EMRforDynamoDB.Overview"></a>

Amazon EMR is a service that makes it easy to quickly and cost-effectively process vast amounts of data. To use Amazon EMR, you launch a managed cluster of Amazon EC2 instances running the Hadoop open source framework. *Hadoop* is a distributed application that implements the MapReduce algorithm, where a task is mapped to multiple nodes in the cluster. Each node processes its designated work, in parallel with the other nodes. Finally, the outputs are reduced on a single node, yielding the final result.

You can choose to launch your Amazon EMR cluster so that it is persistent or transient:
+ A *persistent* cluster runs until you shut it down. Persistent clusters are ideal for data analysis, data warehousing, or any other interactive use.
+ A *transient* cluster runs long enough to process a job flow, and then shuts down automatically. Transient clusters are ideal for periodic processing tasks, such as running scripts.

For information about Amazon EMR architecture and administration, see the [Amazon EMR Management Guide](https://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide).

When you launch an Amazon EMR cluster, you specify the initial number and type of Amazon EC2 instances. You also specify other distributed applications (in addition to Hadoop itself) that you want to run on the cluster. These applications include Hue, Mahout, Pig, Spark, and more.

For information about applications for Amazon EMR, see the [Amazon EMR Release Guide](https://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide).

Depending on the cluster configuration, you might have one or more of the following node types:
+ Leader node — Manages the cluster, coordinating the distribution of the MapReduce executable and subsets of the raw data, to the core and task instance groups. It also tracks the status of each task performed and monitors the health of the instance groups. There is only one leader node in a cluster.
+ Core nodes — Runs MapReduce tasks and stores data using the Hadoop Distributed File System (HDFS).
+ Task nodes (optional) — Runs MapReduce tasks.

# Tutorial: Working with Amazon DynamoDB and Apache Hive
<a name="EMRforDynamoDB.Tutorial"></a>

In this tutorial, you will launch an Amazon EMR cluster, and then use Apache Hive to process data stored in a DynamoDB table.

*Hive* is a data warehouse application for Hadoop that allows you to process and analyze data from multiple sources. Hive provides a SQL-like language, *HiveQL*, that lets you work with data stored locally in the Amazon EMR cluster or in an external data source (such as Amazon DynamoDB).

For more information, see to the [Hive Tutorial](https://cwiki.apache.org/confluence/display/Hive/Tutorial).

**Topics**
+ [Before you begin](#EMRforDynamoDB.Tutorial.BeforeYouBegin)
+ [Step 1: Create an Amazon EC2 key pair](EMRforDynamoDB.Tutorial.EC2KeyPair.md)
+ [Step 2: Launch an Amazon EMR cluster](EMRforDynamoDB.Tutorial.LaunchEMRCluster.md)
+ [Step 3: Connect to the Leader node](EMRforDynamoDB.Tutorial.ConnectToLeaderNode.md)
+ [Step 4: Load data into HDFS](EMRforDynamoDB.Tutorial.LoadDataIntoHDFS.md)
+ [Step 5: Copy data to DynamoDB](EMRforDynamoDB.Tutorial.CopyDataToDDB.md)
+ [Step 6: Query the data in the DynamoDB table](EMRforDynamoDB.Tutorial.QueryDataInDynamoDB.md)
+ [Step 7: (Optional) clean up](EMRforDynamoDB.Tutorial.CleanUp.md)

## Before you begin
<a name="EMRforDynamoDB.Tutorial.BeforeYouBegin"></a>

For this tutorial, you will need the following:
+ An AWS account. If you do not have one, see [Signing up for AWS](SettingUp.DynamoWebService.md#SettingUp.DynamoWebService.SignUpForAWS).
+ An SSH client (Secure Shell). You use the SSH client to connect to the leader node of the Amazon EMR cluster and run interactive commands. SSH clients are available by default on most Linux, Unix, and Mac OS X installations. Windows users can download and install the [PuTTY](http://www.chiark.greenend.org.uk/~sgtatham/putty/) client, which has SSH support.

**Next step**  
[Step 1: Create an Amazon EC2 key pair](EMRforDynamoDB.Tutorial.EC2KeyPair.md)

# Step 1: Create an Amazon EC2 key pair
<a name="EMRforDynamoDB.Tutorial.EC2KeyPair"></a>

In this step, you will create the Amazon EC2 key pair you need to connect to an Amazon EMR leader node and run Hive commands.

1. Sign in to the AWS Management Console and open the Amazon EC2 console at [https://console.aws.amazon.com/ec2/](https://console.aws.amazon.com/ec2/).

1. Choose a region (for example, `US West (Oregon)`). This should be the same region in which your DynamoDB table is located.

1. In the navigation pane, choose **Key Pairs**.

1. Choose **Create Key Pair**. 

1. In **Key pair name**, type a name for your key pair (for example, `mykeypair`), and then choose **Create**. 

1. Download the private key file. The file name will end with `.pem` (such as `mykeypair.pem`). Keep this private key file in a safe place. You will need it to access any Amazon EMR cluster that you launch with this key pair. 
**Important**  
If you lose the key pair, you cannot connect to the leader node of your Amazon EMR cluster.

   For more information about key pairs, see [Amazon EC2 Key Pairs](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html) in the *Amazon EC2 User Guide*. 

**Next step**  
[Step 2: Launch an Amazon EMR cluster](EMRforDynamoDB.Tutorial.LaunchEMRCluster.md)

# Step 2: Launch an Amazon EMR cluster
<a name="EMRforDynamoDB.Tutorial.LaunchEMRCluster"></a>

In this step, you will configure and launch an Amazon EMR cluster. Hive and a storage handler for DynamoDB will already be installed on the cluster.

1. Open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr/).

1. Choose **Create Cluster**.

1. On the **Create Cluster - Quick Options** page, do the following:

   1. In **Cluster name**, type a name for your cluster (for example: `My EMR cluster`).

   1. In **EC2 key pair**, choose the key pair you created earlier.

   Leave the other settings at their defaults.

1. Choose **Create cluster**.

It will take several minutes to launch your cluster. You can use the **Cluster Details** page in the Amazon EMR console to monitor its progress.

When the status changes to `Waiting`, the cluster is ready for use.

## Cluster log files and Amazon S3
<a name="EMRforDynamoDB.Tutorial.LaunchEMRCluster.LogFilesAndS3"></a>

An Amazon EMR cluster generates log files that contain information about the cluster status and debugging information. The default settings for **Create Cluster - Quick Options** include setting up Amazon EMR logging.

If one does not already exist, the AWS Management Console creates an Amazon S3 bucket. The bucket name is `aws-logs-account-id-region`, where ` account-id` is your AWS account number and `region` is the region in which you launched the cluster (for example, `aws-logs-123456789012-us-west-2`).

**Note**  
You can use the Amazon S3 console to view the log files. For more information, see [View Log Files](https://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-manage-view-web-log-files.html) in the *Amazon EMR Management Guide*.

You can use this bucket for purposes in addition to logging. For example, you can use the bucket as a location for storing a Hive script or as a destination when exporting data from Amazon DynamoDB to Amazon S3.

**Next step**  
[Step 3: Connect to the Leader node](EMRforDynamoDB.Tutorial.ConnectToLeaderNode.md)

# Step 3: Connect to the Leader node
<a name="EMRforDynamoDB.Tutorial.ConnectToLeaderNode"></a>

When the status of your Amazon EMR cluster changes to `Waiting`, you will be able to connect to the leader node using SSH and perform command line operations.

1. In the Amazon EMR console, choose your cluster's name to view its status.

1. On the **Cluster Details** page, find the **Leader public DNS** field. This is the public DNS name for the leader node of your Amazon EMR cluster.

1. To the right of the DNS name, choose the **SSH** link.

1. Follow the instructions in **Connect to the Leader Node Using SSH **.

   Depending on your operating system, choose the **Windows** tab or the **Mac/Linux** tab, and follow the instructions for connecting to the leader node.

After you connect to the leader node using either SSH or PuTTY, you should see a command prompt similar to the following:

```
[hadoop@ip-192-0-2-0 ~]$ 
```

**Next step**  
[Step 4: Load data into HDFS](EMRforDynamoDB.Tutorial.LoadDataIntoHDFS.md)

# Step 4: Load data into HDFS
<a name="EMRforDynamoDB.Tutorial.LoadDataIntoHDFS"></a>

In this step, you will copy a data file into Hadoop Distributed File System (HDFS), and then create an external Hive table that maps to the data file.

**Download the sample data**

1. Download the sample data archive (`features.zip`):

   ```
   wget https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/samples/features.zip
   ```

1. Extract the `features.txt` file from the archive:

   ```
   unzip features.zip
   ```

1. View the first few lines of the `features.txt` file:

   ```
   head features.txt
   ```

   The result should look similar to this:

   ```
   1535908|Big Run|Stream|WV|38.6370428|-80.8595469|794
   875609|Constable Hook|Cape|NJ|40.657881|-74.0990309|7
   1217998|Gooseberry Island|Island|RI|41.4534361|-71.3253284|10
   26603|Boone Moore Spring|Spring|AZ|34.0895692|-111.410065|3681
   1506738|Missouri Flat|Flat|WA|46.7634987|-117.0346113|2605
   1181348|Minnow Run|Stream|PA|40.0820178|-79.3800349|1558
   1288759|Hunting Creek|Stream|TN|36.343969|-83.8029682|1024
   533060|Big Charles Bayou|Bay|LA|29.6046517|-91.9828654|0
   829689|Greenwood Creek|Stream|NE|41.596086|-103.0499296|3671
   541692|Button Willow Island|Island|LA|31.9579389|-93.0648847|98
   ```

   The `features.txt` file contains a subset of data from the United States Board on Geographic Names ([http://geonames.usgs.gov/domestic/download\$1data.htm](http://geonames.usgs.gov/domestic/download_data.htm)). The fields in each line represent the following:
   + Feature ID (unique identifier)
   + Name
   + Class (lake; forest; stream; and so on)
   + State
   + Latitude (degrees)
   + Longitude (degrees)
   + Height (in feet)

1. At the command prompt, enter the following command:

   ```
   hive
   ```

   The command prompt changes to this: `hive>` 

1. Enter the following HiveQL statement to create a native Hive table:

   ```
   CREATE TABLE hive_features
       (feature_id             BIGINT,
       feature_name            STRING ,
       feature_class           STRING ,
       state_alpha             STRING,
       prim_lat_dec            DOUBLE ,
       prim_long_dec           DOUBLE ,
       elev_in_ft              BIGINT)
       ROW FORMAT DELIMITED
       FIELDS TERMINATED BY '|'
       LINES TERMINATED BY '\n';
   ```

1. Enter the following HiveQL statement to load the table with data:

   ```
   LOAD DATA
   LOCAL
   INPATH './features.txt'
   OVERWRITE
   INTO TABLE hive_features;
   ```

1. You now have a native Hive table populated with data from the `features.txt` file. To verify, enter the following HiveQL statement:

   ```
   SELECT state_alpha, COUNT(*)
   FROM hive_features
   GROUP BY state_alpha;
   ```

   The output should show a list of states and the number of geographic features in each.

**Next step**  
[Step 5: Copy data to DynamoDB](EMRforDynamoDB.Tutorial.CopyDataToDDB.md)

# Step 5: Copy data to DynamoDB
<a name="EMRforDynamoDB.Tutorial.CopyDataToDDB"></a>

In this step, you will copy data from the Hive table (`hive_features`) to a new table in DynamoDB.

1. Open the DynamoDB console at [https://console.aws.amazon.com/dynamodb/](https://console.aws.amazon.com/dynamodb/).

1. Choose **Create Table**.

1. On the **Create DynamoDB table** page, do the following:

   1. In **Table**, type **Features**.

   1. For **Primary key**, in the **Partition key** field, type **Id**. Set the data type to **Number**.

      Clear **Use Default Settings**. For **Provisioned Capacity**, type the following:
      + **Read Capacity Units**—`10`
      + **Write Capacity Units**—`10`

   Choose **Create**.

1. At the Hive prompt, enter the following HiveQL statement: 

   ```
   CREATE EXTERNAL TABLE ddb_features
       (feature_id   BIGINT,
       feature_name  STRING,
       feature_class STRING,
       state_alpha   STRING,
       prim_lat_dec  DOUBLE,
       prim_long_dec DOUBLE,
       elev_in_ft    BIGINT)
   STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
   TBLPROPERTIES(
       "dynamodb.table.name" = "Features",
       "dynamodb.column.mapping"="feature_id:Id,feature_name:Name,feature_class:Class,state_alpha:State,prim_lat_dec:Latitude,prim_long_dec:Longitude,elev_in_ft:Elevation"
   );
   ```

   You have now established a mapping between Hive and the Features table in DynamoDB.

1. Enter the following HiveQL statement to import data to DynamoDB:

   ```
   INSERT OVERWRITE TABLE ddb_features
   SELECT
       feature_id,
       feature_name,
       feature_class,
       state_alpha,
       prim_lat_dec,
       prim_long_dec,
       elev_in_ft
   FROM hive_features;
   ```

   Hive will submit a MapReduce job, which will be processed by your Amazon EMR cluster. It will take several minutes to complete the job.

1. Verify that the data has been loaded into DynamoDB:

   1. In the DynamoDB console navigation pane, choose **Tables**.

   1. Choose the Features table, and then choose the **Items** tab to view the data.

**Next step**  
[Step 6: Query the data in the DynamoDB table](EMRforDynamoDB.Tutorial.QueryDataInDynamoDB.md)

# Step 6: Query the data in the DynamoDB table
<a name="EMRforDynamoDB.Tutorial.QueryDataInDynamoDB"></a>

In this step, you will use HiveQL to query the Features table in DynamoDB. Try the following Hive queries:

1. All of the feature types (`feature_class`) in alphabetical order:

   ```
   SELECT DISTINCT feature_class
   FROM ddb_features
   ORDER BY feature_class;
   ```

1. All of the lakes that begin with the letter "M":

   ```
   SELECT feature_name, state_alpha
   FROM ddb_features
   WHERE feature_class = 'Lake'
   AND feature_name LIKE 'M%'
   ORDER BY feature_name;
   ```

1. States with at least three features higher than a mile (5,280 feet):

   ```
   SELECT state_alpha, feature_class, COUNT(*)
   FROM ddb_features
   WHERE elev_in_ft > 5280
   GROUP by state_alpha, feature_class
   HAVING COUNT(*) >= 3
   ORDER BY state_alpha, feature_class;
   ```

**Next step**  
[Step 7: (Optional) clean up](EMRforDynamoDB.Tutorial.CleanUp.md)

# Step 7: (Optional) clean up
<a name="EMRforDynamoDB.Tutorial.CleanUp"></a>

Now that you have completed the tutorial, you can continue reading this section to learn more about working with DynamoDB data in Amazon EMR. You might decide to keep your Amazon EMR cluster up and running while you do this.

If you don't need the cluster anymore, you should terminate it and remove any associated resources. This will help you avoid being charged for resources you don't need.

1. Terminate the Amazon EMR cluster:

   1. Open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr/).

   1. Choose the Amazon EMR cluster, choose **Terminate**, and then confirm.

1. Delete the Features table in DynamoDB:

   1. Open the DynamoDB console at [https://console.aws.amazon.com/dynamodb/](https://console.aws.amazon.com/dynamodb/).

   1. In the navigation pane, choose **Tables**.

   1. Choose the Features table. From the **Actions** menu, choose **Delete Table**.

1. Delete the Amazon S3 bucket containing the Amazon EMR log files:

   1. Open the Amazon S3 console at [https://console.aws.amazon.com/s3/](https://console.aws.amazon.com/s3/).

   1. From the list of buckets, choose `aws-logs- accountID-region`, where *accountID* is your AWS account number and *region* is the region in which you launched the cluster.

   1. From the **Action** menu, choose **Delete**.

# Creating an external table in Hive
<a name="EMRforDynamoDB.ExternalTableForDDB"></a>

In [Tutorial: Working with Amazon DynamoDB and Apache Hive](EMRforDynamoDB.Tutorial.md), you created an external Hive table that mapped to a DynamoDB table. When you issued HiveQL statements against the external table, the read and write operations were passed through to the DynamoDB table.

You can think of an external table as a pointer to a data source that is managed and stored elsewhere. In this case, the underlying data source is a DynamoDB table. (The table must already exist. You cannot create, update, or delete a DynamoDB table from within Hive.) You use the `CREATE EXTERNAL TABLE` statement to create the external table. After that, you can use HiveQL to work with data in DynamoDB, as if that data were stored locally within Hive.

**Note**  
You can use `INSERT` statements to insert data into an external table and `SELECT` statements to select data from it. However, you cannot use `UPDATE` or `DELETE` statements to manipulate data in the table.

If you no longer need the external table, you can remove it using the `DROP TABLE` statement. In this case, `DROP TABLE` only removes the external table in Hive. It does not affect the underlying DynamoDB table or any of its data.

**Topics**
+ [CREATE EXTERNAL TABLE syntax](#EMRforDynamoDB.ExternalTableForDDB.Syntax)
+ [Data type mappings](#EMRforDynamoDB.ExternalTableForDDB.DataTypes)

## CREATE EXTERNAL TABLE syntax
<a name="EMRforDynamoDB.ExternalTableForDDB.Syntax"></a>

The following shows the HiveQL syntax for creating an external Hive table that maps to a DynamoDB table:

```
CREATE EXTERNAL TABLE hive_table
    (hive_column1_name hive_column1_datatype, hive_column2_name hive_column2_datatype...)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES (
    "dynamodb.table.name" = "dynamodb_table",
    "dynamodb.column.mapping" = "hive_column1_name:dynamodb_attribute1_name,hive_column2_name:dynamodb_attribute2_name..."
);
```

Line 1 is the start of the `CREATE EXTERNAL TABLE` statement, where you provide the name of the Hive table (*hive\$1table*) you want to create.

Line 2 specifies the columns and data types for *hive\$1table*. You need to define columns and data types that correspond to the attributes in the DynamoDB table. 

Line 3 is the `STORED BY` clause, where you specify a class that handles data management between the Hive and the DynamoDB table. For DynamoDB, `STORED BY` should be set to `'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'`. 

Line 4 is the start of the `TBLPROPERTIES` clause, where you define the following parameters for `DynamoDBStorageHandler`: 
+ `dynamodb.table.name`—the name of the DynamoDB table.
+  `dynamodb.column.mapping`—pairs of column names in the Hive table and their corresponding attributes in the DynamoDB table. Each pair is of the form *hive\$1column\$1name:dynamodb\$1attribute\$1name*, and the pairs are separated by commas.

Note the following:
+ The name of the Hive table name does not have to be the same as the DynamoDB table name.
+ The Hive table column names do not have to be the same as those in the DynamoDB table.
+ The table specified by `dynamodb.table.name` must exist in DynamoDB.
+ For `dynamodb.column.mapping`:
  + You must map the key schema attributes for the DynamoDB table. This includes the partition key and the sort key (if present).
  + You do not have to map the non-key attributes of the DynamoDB table. However, you will not see any data from those attributes when you query the Hive table.
  + If the data types of a Hive table column and a DynamoDB attribute are incompatible, you will see `NULL` in these columns when you query the Hive table.

**Note**  
The `CREATE EXTERNAL TABLE` statement does not perform any validation on the `TBLPROPERTIES` clause. The values you provide for `dynamodb.table.name` and `dynamodb.column.mapping` are only evaluated by the `DynamoDBStorageHandler` class when you attempt to access the table.

## Data type mappings
<a name="EMRforDynamoDB.ExternalTableForDDB.DataTypes"></a>

The following table shows DynamoDB data types and compatible Hive data types:


****  

| DynamoDB Data Type | Hive Data Type | 
| --- | --- | 
|  String  |  `STRING`  | 
|  Number  |  `BIGINT` or `DOUBLE`  | 
|  Binary  |  `BINARY`  | 
|  String Set  |  `ARRAY<STRING>`  | 
|  Number Set  |  `ARRAY<BIGINT>` or `ARRAY<DOUBLE>`  | 
|  Binary Set  |  `ARRAY<BINARY>`  | 

**Note**  
The following DynamoDB data types are not supported by the `DynamoDBStorageHandler` class, so they cannot be used with `dynamodb.column.mapping`:  
Map
List
Boolean
Null
However, if you need to work with these data types, you can create a single entity called `item` that represents the entire DynamoDB item as a map of strings for both keys and values in the map. For more information, see [Copying data without a column mapping](EMRforDynamoDB.CopyingData.S3.md#EMRforDynamoDB.CopyingData.S3.NoColumnMapping)

If you want to map a DynamoDB attribute of type Number, you must choose an appropriate Hive type:
+ The Hive `BIGINT` type is for 8-byte signed integers. It is the same as the `long` data type in Java.
+ The Hive `DOUBLE` type is for 8-bit double precision floating point numbers. It is the same as the `double` type in Java.

If you have numeric data stored in DynamoDB that has a higher precision than the Hive data type you choose, then accessing the DynamoDB data could cause a loss of precision. 

If you export data of type Binary from DynamoDB to (Amazon S3) or HDFS, the data is stored as a Base64-encoded string. If you import data from Amazon S3 or HDFS into the DynamoDB Binary type, you must ensure the data is encoded as a Base64 string.

# Processing HiveQL statements
<a name="EMRforDynamoDB.ProcessingHiveQL"></a>

Hive is an application that runs on Hadoop, which is a batch-oriented framework for running MapReduce jobs. When you issue a HiveQL statement, Hive determines whether it can return the results immediately or whether it must submit a MapReduce job.

For example, consider the *ddb\$1features* table (from [Tutorial: Working with Amazon DynamoDB and Apache Hive](EMRforDynamoDB.Tutorial.md)). The following Hive query prints state abbreviations and the number of summits in each:

```
SELECT state_alpha, count(*)
FROM ddb_features
WHERE feature_class = 'Summit'
GROUP BY state_alpha;
```

Hive does not return the results immediately. Instead, it submits a MapReduce job, which is processed by the Hadoop framework. Hive will wait until the job is complete before it shows the results from the query:

```
AK  2
AL  2
AR  2
AZ  3
CA  7
CO  2
CT  2
ID  1
KS  1
ME  2
MI  1
MT  3
NC  1
NE  1
NM  1
NY  2
OR  5
PA  1
TN  1
TX  1
UT  4
VA  1
VT  2
WA  2
WY  3
Time taken: 8.753 seconds, Fetched: 25 row(s)
```

## Monitoring and canceling jobs
<a name="EMRforDynamoDB.MonitorAndCancelJob"></a>

When Hive launches a Hadoop job, it prints output from that job. The job completion status is updated as the job progresses. In some cases, the status might not be updated for a long time. (This can happen when you are querying a large DynamoDB table that has a low provisioned read capacity setting.)

If you need to cancel the job before it is complete, you can type **Ctrl\$1C** at any time.

# Querying data in DynamoDB
<a name="EMRforDynamoDB.Querying"></a>

The following examples show some ways that you can use HiveQL to query data stored in DynamoDB.

These examples refer to the *ddb\$1features* table in the tutorial ([Step 5: Copy data to DynamoDB](EMRforDynamoDB.Tutorial.CopyDataToDDB.md)).

**Topics**
+ [Using aggregate functions](#EMRforDynamoDB.Querying.AggregateFunctions)
+ [Using the GROUP BY and HAVING clauses](#EMRforDynamoDB.Querying.GroupByAndHaving)
+ [Joining two DynamoDB tables](#EMRforDynamoDB.Querying.JoiningTwoTables)
+ [Joining tables from different sources](#EMRforDynamoDB.Querying.JoiningTablesFromDifferentSources)

## Using aggregate functions
<a name="EMRforDynamoDB.Querying.AggregateFunctions"></a>

HiveQL provides built-in functions for summarizing data values. For example, you can use the `MAX` function to find the largest value for a selected column. The following example returns the elevation of the highest feature in the state of Colorado.

```
SELECT MAX(elev_in_ft)
FROM ddb_features
WHERE state_alpha = 'CO';
```

## Using the GROUP BY and HAVING clauses
<a name="EMRforDynamoDB.Querying.GroupByAndHaving"></a>

You can use the `GROUP BY` clause to collect data across multiple records. This is often used with an aggregate function such as `SUM`, `COUNT`, `MIN`, or `MAX`. You can also use the `HAVING` clause to discard any results that do not meet certain criteria.

The following example returns a list of the highest elevations from states that have more than five features in the *ddb\$1features* table.

```
SELECT state_alpha, max(elev_in_ft)
FROM ddb_features
GROUP BY state_alpha
HAVING count(*) >= 5;
```

## Joining two DynamoDB tables
<a name="EMRforDynamoDB.Querying.JoiningTwoTables"></a>

The following example maps another Hive table (*east\$1coast\$1states*) to a table in DynamoDB. The `SELECT` statement is a join across these two tables. The join is computed on the cluster and returned. The join does not take place in DynamoDB. 

Consider a DynamoDB table named EastCoastStates that contains the following data:

```
StateName       StateAbbrev

Maine           ME
New Hampshire   NH
Massachusetts   MA
Rhode Island    RI
Connecticut     CT
New York        NY
New Jersey      NJ
Delaware        DE
Maryland        MD
Virginia        VA
North Carolina  NC
South Carolina  SC
Georgia         GA
Florida         FL
```

Let's assume the table is available as a Hive external table named east\$1coast\$1states:

```
1. CREATE EXTERNAL TABLE ddb_east_coast_states (state_name STRING, state_alpha STRING)
2. STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
3. TBLPROPERTIES ("dynamodb.table.name" = "EastCoastStates",
4. "dynamodb.column.mapping" = "state_name:StateName,state_alpha:StateAbbrev");
```

The following join returns the states on the East Coast of the United States that have at least three features:

```
SELECT ecs.state_name, f.feature_class, COUNT(*)
FROM ddb_east_coast_states ecs
JOIN ddb_features f on ecs.state_alpha = f.state_alpha
GROUP BY ecs.state_name, f.feature_class
HAVING COUNT(*) >= 3;
```

## Joining tables from different sources
<a name="EMRforDynamoDB.Querying.JoiningTablesFromDifferentSources"></a>

In the following example, s3\$1east\$1coast\$1states is a Hive table associated with a CSV file stored in Amazon S3. The *ddb\$1features* table is associated with data in DynamoDB. The following example joins these two tables, returning the geographic features from states whose names begin with "New."

```
1. create external table s3_east_coast_states (state_name STRING, state_alpha STRING)
2. ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
3. LOCATION 's3://bucketname/path/subpath/';
```

```
SELECT ecs.state_name, f.feature_name, f.feature_class
FROM s3_east_coast_states ecs
JOIN ddb_features f
ON ecs.state_alpha = f.state_alpha
WHERE ecs.state_name LIKE 'New%';
```

# Copying data to and from Amazon DynamoDB
<a name="EMRforDynamoDB.CopyingData"></a>

In the [Tutorial: Working with Amazon DynamoDB and Apache Hive](EMRforDynamoDB.Tutorial.md), you copied data from a native Hive table into an external DynamoDB table, and then queried the external DynamoDB table. The table is external because it exists outside of Hive. Even if you drop the Hive table that maps to it, the table in DynamoDB is not affected.

Hive is an excellent solution for copying data among DynamoDB tables, Amazon S3 buckets, native Hive tables, and Hadoop Distributed File System (HDFS). This section provides examples of these operations.

**Topics**
+ [Copying data between DynamoDB and a native Hive table](EMRforDynamoDB.CopyingData.NativeHive.md)
+ [Copying data between DynamoDB and Amazon S3](EMRforDynamoDB.CopyingData.S3.md)
+ [Copying data between DynamoDB and HDFS](EMRforDynamoDB.CopyingData.HDFS.md)
+ [Using data compression](EMRforDynamoDB.CopyingData.Compression.md)
+ [Reading non-printable UTF-8 character data](EMRforDynamoDB.CopyingData.NonPrintableData.md)

# Copying data between DynamoDB and a native Hive table
<a name="EMRforDynamoDB.CopyingData.NativeHive"></a>

If you have data in a DynamoDB table, you can copy the data to a native Hive table. This will give you a snapshot of the data, as of the time you copied it. 

You might decide to do this if you need to perform many HiveQL queries, but do not want to consume provisioned throughput capacity from DynamoDB. Because the data in the native Hive table is a copy of the data from DynamoDB, and not "live" data, your queries should not expect that the data is up-to-date.

**Note**  
The examples in this section are written with the assumption you followed the steps in [Tutorial: Working with Amazon DynamoDB and Apache Hive](EMRforDynamoDB.Tutorial.md) and have an external table in DynamoDB named *ddb\$1features*. 

**Example From DynamoDB to native Hive table**  
You can create a native Hive table and populate it with data from *ddb\$1features*, like this:  

```
CREATE TABLE features_snapshot AS
SELECT * FROM ddb_features;
```
You can then refresh the data at any time:  

```
INSERT OVERWRITE TABLE features_snapshot
SELECT * FROM ddb_features;
```
In these examples, the subquery `SELECT * FROM ddb_features` will retrieve all of the data from *ddb\$1features*. If you only want to copy a subset of the data, you can use a `WHERE` clause in the subquery.  
The following example creates a native Hive table, containing only some of the attributes for lakes and summits:  

```
CREATE TABLE lakes_and_summits AS
SELECT feature_name, feature_class, state_alpha
FROM ddb_features
WHERE feature_class IN ('Lake','Summit');
```

**Example From native Hive table to DynamoDB**  
Use the following HiveQL statement to copy the data from the native Hive table to *ddb\$1features*:  

```
INSERT OVERWRITE TABLE ddb_features
SELECT * FROM features_snapshot;
```

# Copying data between DynamoDB and Amazon S3
<a name="EMRforDynamoDB.CopyingData.S3"></a>

If you have data in a DynamoDB table, you can use Hive to copy the data to an Amazon S3 bucket.

You might do this if you want to create an archive of data in your DynamoDB table. For example, suppose you have a test environment where you need to work with a baseline set of test data in DynamoDB. You can copy the baseline data to an Amazon S3 bucket, and then run your tests. Afterward, you can reset the test environment by restoring the baseline data from the Amazon S3 bucket to DynamoDB.

If you worked through [Tutorial: Working with Amazon DynamoDB and Apache Hive](EMRforDynamoDB.Tutorial.md), then you already have an Amazon S3 bucket that contains your Amazon EMR logs. You can use this bucket for the examples in this section, if you know the root path for the bucket:

1. Open the Amazon EMR console at [https://console.aws.amazon.com/emr](https://console.aws.amazon.com/emr/).

1. For **Name**, choose your cluster.

1. The URI is listed in **Log URI** under **Configuration Details**.

1. Make a note of the root path of the bucket. The naming convention is:

   `s3://aws-logs-accountID-region`

   where *accountID* is your AWS account ID and region is the AWS region for the bucket.

**Note**  
For these examples, we will use a subpath within the bucket, as in this example:  
 `s3://aws-logs-123456789012-us-west-2/hive-test`

The following procedures are written with the assumption you followed the steps in the tutorial and have an external table in DynamoDB named *ddb\$1features*.

**Topics**
+ [Copying data using the Hive default format](#EMRforDynamoDB.CopyingData.S3.DefaultFormat)
+ [Copying data with a user-specified format](#EMRforDynamoDB.CopyingData.S3.UserSpecifiedFormat)
+ [Copying data without a column mapping](#EMRforDynamoDB.CopyingData.S3.NoColumnMapping)
+ [Viewing the data in Amazon S3](#EMRforDynamoDB.CopyingData.S3.ViewingData)

## Copying data using the Hive default format
<a name="EMRforDynamoDB.CopyingData.S3.DefaultFormat"></a>

**Example From DynamoDB to Amazon S3**  
Use an `INSERT OVERWRITE` statement to write directly to Amazon S3.  

```
INSERT OVERWRITE DIRECTORY 's3://aws-logs-123456789012-us-west-2/hive-test'
SELECT * FROM ddb_features;
```
The data file in Amazon S3 looks like this:  

```
920709^ASoldiers Farewell Hill^ASummit^ANM^A32.3564729^A-108.33004616135
1178153^AJones Run^AStream^APA^A41.2120086^A-79.25920781260
253838^ASentinel Dome^ASummit^ACA^A37.7229821^A-119.584338133
264054^ANeversweet Gulch^AValley^ACA^A41.6565269^A-122.83614322900
115905^AChacaloochee Bay^ABay^AAL^A30.6979676^A-87.97388530
```
Each field is separated by an SOH character (start of heading, 0x01). In the file, SOH appears as **^A**.

**Example From Amazon S3 to DynamoDB**  

1. Create an external table pointing to the unformatted data in Amazon S3.

   ```
   CREATE EXTERNAL TABLE s3_features_unformatted
       (feature_id       BIGINT,
       feature_name      STRING ,
       feature_class     STRING ,
       state_alpha       STRING,
       prim_lat_dec      DOUBLE ,
       prim_long_dec     DOUBLE ,
       elev_in_ft        BIGINT)
   LOCATION 's3://aws-logs-123456789012-us-west-2/hive-test';
   ```

1. Copy the data to DynamoDB.

   ```
   INSERT OVERWRITE TABLE ddb_features
   SELECT * FROM s3_features_unformatted;
   ```

## Copying data with a user-specified format
<a name="EMRforDynamoDB.CopyingData.S3.UserSpecifiedFormat"></a>

If you want to specify your own field separator character, you can create an external table that maps to the Amazon S3 bucket. You might use this technique for creating data files with comma-separated values (CSV).

**Example From DynamoDB to Amazon S3**  

1. Create a Hive external table that maps to Amazon S3. When you do this, ensure that the data types are consistent with those of the DynamoDB external table.

   ```
   CREATE EXTERNAL TABLE s3_features_csv
       (feature_id       BIGINT,
       feature_name      STRING,
       feature_class     STRING,
       state_alpha       STRING,
       prim_lat_dec      DOUBLE,
       prim_long_dec     DOUBLE,
       elev_in_ft        BIGINT)
   ROW FORMAT DELIMITED
   FIELDS TERMINATED BY ','
   LOCATION 's3://aws-logs-123456789012-us-west-2/hive-test';
   ```

1. Copy the data from DynamoDB.

   ```
   INSERT OVERWRITE TABLE s3_features_csv
   SELECT * FROM ddb_features;
   ```
The data file in Amazon S3 looks like this:  

```
920709,Soldiers Farewell Hill,Summit,NM,32.3564729,-108.3300461,6135
1178153,Jones Run,Stream,PA,41.2120086,-79.2592078,1260
253838,Sentinel Dome,Summit,CA,37.7229821,-119.58433,8133
264054,Neversweet Gulch,Valley,CA,41.6565269,-122.8361432,2900
115905,Chacaloochee Bay,Bay,AL,30.6979676,-87.9738853,0
```

**Example From Amazon S3 to DynamoDB**  
With a single HiveQL statement, you can populate the DynamoDB table using the data from Amazon S3:  

```
INSERT OVERWRITE TABLE ddb_features
SELECT * FROM s3_features_csv;
```

## Copying data without a column mapping
<a name="EMRforDynamoDB.CopyingData.S3.NoColumnMapping"></a>

You can copy data from DynamoDB in a raw format and write it to Amazon S3 without specifying any data types or column mapping. You can use this method to create an archive of DynamoDB data and store it in Amazon S3.



**Example From DynamoDB to Amazon S3**  

1. Create an external table associated with your DynamoDB table. (There is no `dynamodb.column.mapping` in this HiveQL statement.)

   ```
   CREATE EXTERNAL TABLE ddb_features_no_mapping
       (item MAP<STRING, STRING>)
   STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
   TBLPROPERTIES ("dynamodb.table.name" = "Features");
   ```

   

1. Create another external table associated with your Amazon S3 bucket.

   ```
   CREATE EXTERNAL TABLE s3_features_no_mapping
       (item MAP<STRING, STRING>)
   ROW FORMAT DELIMITED
   FIELDS TERMINATED BY '\t'
   LINES TERMINATED BY '\n'
   LOCATION 's3://aws-logs-123456789012-us-west-2/hive-test';
   ```

1. Copy the data from DynamoDB to Amazon S3.

   ```
   INSERT OVERWRITE TABLE s3_features_no_mapping
   SELECT * FROM ddb_features_no_mapping;
   ```
The data file in Amazon S3 looks like this:  

```
Name^C{"s":"Soldiers Farewell Hill"}^BState^C{"s":"NM"}^BClass^C{"s":"Summit"}^BElevation^C{"n":"6135"}^BLatitude^C{"n":"32.3564729"}^BId^C{"n":"920709"}^BLongitude^C{"n":"-108.3300461"}
Name^C{"s":"Jones Run"}^BState^C{"s":"PA"}^BClass^C{"s":"Stream"}^BElevation^C{"n":"1260"}^BLatitude^C{"n":"41.2120086"}^BId^C{"n":"1178153"}^BLongitude^C{"n":"-79.2592078"}
Name^C{"s":"Sentinel Dome"}^BState^C{"s":"CA"}^BClass^C{"s":"Summit"}^BElevation^C{"n":"8133"}^BLatitude^C{"n":"37.7229821"}^BId^C{"n":"253838"}^BLongitude^C{"n":"-119.58433"}
Name^C{"s":"Neversweet Gulch"}^BState^C{"s":"CA"}^BClass^C{"s":"Valley"}^BElevation^C{"n":"2900"}^BLatitude^C{"n":"41.6565269"}^BId^C{"n":"264054"}^BLongitude^C{"n":"-122.8361432"}
Name^C{"s":"Chacaloochee Bay"}^BState^C{"s":"AL"}^BClass^C{"s":"Bay"}^BElevation^C{"n":"0"}^BLatitude^C{"n":"30.6979676"}^BId^C{"n":"115905"}^BLongitude^C{"n":"-87.9738853"}
```
Each field begins with an STX character (start of text, 0x02) and ends with an ETX character (end of text, 0x03). In the file, STX appears as **^B** and ETX appears as **^C**.

**Example From Amazon S3 to DynamoDB**  
With a single HiveQL statement, you can populate the DynamoDB table using the data from Amazon S3:  

```
INSERT OVERWRITE TABLE ddb_features_no_mapping
SELECT * FROM s3_features_no_mapping;
```

## Viewing the data in Amazon S3
<a name="EMRforDynamoDB.CopyingData.S3.ViewingData"></a>

If you use SSH to connect to the leader node, you can use the AWS Command Line Interface (AWS CLI) to access the data that Hive wrote to Amazon S3.

The following steps are written with the assumption you have copied data from DynamoDB to Amazon S3 using one of the procedures in this section.

1. If you are currently at the Hive command prompt, exit to the Linux command prompt.

   ```
   hive> exit;
   ```

1. List the contents of the hive-test directory in your Amazon S3 bucket. (This is where Hive copied the data from DynamoDB.)

   ```
   aws s3 ls s3://aws-logs-123456789012-us-west-2/hive-test/
   ```

   The response should look similar to this:

   `2016-11-01 23:19:54 81983 000000_0` 

   The file name (*000000\$10*) is system-generated.

1. (Optional) You can copy the data file from Amazon S3 to the local file system on the leader node. After you do this, you can use standard Linux command line utilities to work with the data in the file.

   ```
   aws s3 cp s3://aws-logs-123456789012-us-west-2/hive-test/000000_0 .
   ```

   The response should look similar to this:

   `download: s3://aws-logs-123456789012-us-west-2/hive-test/000000_0 to ./000000_0`
**Note**  
The local file system on the leader node has limited capacity. Do not use this command with files that are larger than the available space in the local file system.

# Copying data between DynamoDB and HDFS
<a name="EMRforDynamoDB.CopyingData.HDFS"></a>

If you have data in a DynamoDB table, you can use Hive to copy the data to the Hadoop Distributed File System (HDFS).

You might do this if you are running a MapReduce job that requires data from DynamoDB. If you copy the data from DynamoDB into HDFS, Hadoop can process it, using all of the available nodes in the Amazon EMR cluster in parallel. When the MapReduce job is complete, you can then write the results from HDFS to DDB.

In the following examples, Hive will read from and write to the following HDFS directory: `/user/hadoop/hive-test`

**Note**  
The examples in this section are written with the assumption you followed the steps in [Tutorial: Working with Amazon DynamoDB and Apache Hive](EMRforDynamoDB.Tutorial.md) and you have an external table in DynamoDB named *ddb\$1features*. 

**Topics**
+ [Copying data using the Hive default format](#EMRforDynamoDB.CopyingData.HDFS.DefaultFormat)
+ [Copying data with a user-specified format](#EMRforDynamoDB.CopyingData.HDFS.UserSpecifiedFormat)
+ [Copying data without a column mapping](#EMRforDynamoDB.CopyingData.HDFS.NoColumnMapping)
+ [Accessing the data in HDFS](#EMRforDynamoDB.CopyingData.HDFS.ViewingData)

## Copying data using the Hive default format
<a name="EMRforDynamoDB.CopyingData.HDFS.DefaultFormat"></a>

**Example From DynamoDB to HDFS**  
Use an `INSERT OVERWRITE` statement to write directly to HDFS.  

```
INSERT OVERWRITE DIRECTORY 'hdfs:///user/hadoop/hive-test'
SELECT * FROM ddb_features;
```
The data file in HDFS looks like this:  

```
920709^ASoldiers Farewell Hill^ASummit^ANM^A32.3564729^A-108.33004616135
1178153^AJones Run^AStream^APA^A41.2120086^A-79.25920781260
253838^ASentinel Dome^ASummit^ACA^A37.7229821^A-119.584338133
264054^ANeversweet Gulch^AValley^ACA^A41.6565269^A-122.83614322900
115905^AChacaloochee Bay^ABay^AAL^A30.6979676^A-87.97388530
```
Each field is separated by an SOH character (start of heading, 0x01). In the file, SOH appears as **^A**.

**Example From HDFS to DynamoDB**  

1. Create an external table that maps to the unformatted data in HDFS.

   ```
   CREATE EXTERNAL TABLE hdfs_features_unformatted
       (feature_id       BIGINT,
       feature_name      STRING ,
       feature_class     STRING ,
       state_alpha       STRING,
       prim_lat_dec      DOUBLE ,
       prim_long_dec     DOUBLE ,
       elev_in_ft        BIGINT)
   LOCATION 'hdfs:///user/hadoop/hive-test';
   ```

1. Copy the data to DynamoDB.

   ```
   INSERT OVERWRITE TABLE ddb_features
   SELECT * FROM hdfs_features_unformatted;
   ```

## Copying data with a user-specified format
<a name="EMRforDynamoDB.CopyingData.HDFS.UserSpecifiedFormat"></a>

If you want to use a different field separator character, you can create an external table that maps to the HDFS directory. You might use this technique for creating data files with comma-separated values (CSV).

**Example From DynamoDB to HDFS**  

1. Create a Hive external table that maps to HDFS. When you do this, ensure that the data types are consistent with those of the DynamoDB external table.

   ```
   CREATE EXTERNAL TABLE hdfs_features_csv
       (feature_id       BIGINT,
       feature_name      STRING ,
       feature_class     STRING ,
       state_alpha       STRING,
       prim_lat_dec      DOUBLE ,
       prim_long_dec     DOUBLE ,
       elev_in_ft        BIGINT)
   ROW FORMAT DELIMITED
   FIELDS TERMINATED BY ','
   LOCATION 'hdfs:///user/hadoop/hive-test';
   ```

1. Copy the data from DynamoDB.

   ```
   INSERT OVERWRITE TABLE hdfs_features_csv
   SELECT * FROM ddb_features;
   ```
The data file in HDFS looks like this:  

```
920709,Soldiers Farewell Hill,Summit,NM,32.3564729,-108.3300461,6135
1178153,Jones Run,Stream,PA,41.2120086,-79.2592078,1260
253838,Sentinel Dome,Summit,CA,37.7229821,-119.58433,8133
264054,Neversweet Gulch,Valley,CA,41.6565269,-122.8361432,2900
115905,Chacaloochee Bay,Bay,AL,30.6979676,-87.9738853,0
```

**Example From HDFS to DynamoDB**  
With a single HiveQL statement, you can populate the DynamoDB table using the data from HDFS:  

```
INSERT OVERWRITE TABLE ddb_features
SELECT * FROM hdfs_features_csv;
```

## Copying data without a column mapping
<a name="EMRforDynamoDB.CopyingData.HDFS.NoColumnMapping"></a>

You can copy data from DynamoDB in a raw format and write it to HDFS without specifying any data types or column mapping. You can use this method to create an archive of DynamoDB data and store it in HDFS.



**Note**  
If your DynamoDB table contains attributes of type Map, List, Boolean or Null, then this is the only way you can use Hive to copy data from DynamoDB to HDFS.

**Example From DynamoDB to HDFS**  

1. Create an external table associated with your DynamoDB table. (There is no `dynamodb.column.mapping` in this HiveQL statement.)

   ```
   CREATE EXTERNAL TABLE ddb_features_no_mapping
       (item MAP<STRING, STRING>)
   STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
   TBLPROPERTIES ("dynamodb.table.name" = "Features");
   ```

   

1. Create another external table associated with your HDFS directory.

   ```
   CREATE EXTERNAL TABLE hdfs_features_no_mapping
       (item MAP<STRING, STRING>)
   ROW FORMAT DELIMITED
   FIELDS TERMINATED BY '\t'
   LINES TERMINATED BY '\n'
   LOCATION 'hdfs:///user/hadoop/hive-test';
   ```

1. Copy the data from DynamoDB to HDFS.

   ```
   INSERT OVERWRITE TABLE hdfs_features_no_mapping
   SELECT * FROM ddb_features_no_mapping;
   ```
The data file in HDFS looks like this:  

```
Name^C{"s":"Soldiers Farewell Hill"}^BState^C{"s":"NM"}^BClass^C{"s":"Summit"}^BElevation^C{"n":"6135"}^BLatitude^C{"n":"32.3564729"}^BId^C{"n":"920709"}^BLongitude^C{"n":"-108.3300461"}
Name^C{"s":"Jones Run"}^BState^C{"s":"PA"}^BClass^C{"s":"Stream"}^BElevation^C{"n":"1260"}^BLatitude^C{"n":"41.2120086"}^BId^C{"n":"1178153"}^BLongitude^C{"n":"-79.2592078"}
Name^C{"s":"Sentinel Dome"}^BState^C{"s":"CA"}^BClass^C{"s":"Summit"}^BElevation^C{"n":"8133"}^BLatitude^C{"n":"37.7229821"}^BId^C{"n":"253838"}^BLongitude^C{"n":"-119.58433"}
Name^C{"s":"Neversweet Gulch"}^BState^C{"s":"CA"}^BClass^C{"s":"Valley"}^BElevation^C{"n":"2900"}^BLatitude^C{"n":"41.6565269"}^BId^C{"n":"264054"}^BLongitude^C{"n":"-122.8361432"}
Name^C{"s":"Chacaloochee Bay"}^BState^C{"s":"AL"}^BClass^C{"s":"Bay"}^BElevation^C{"n":"0"}^BLatitude^C{"n":"30.6979676"}^BId^C{"n":"115905"}^BLongitude^C{"n":"-87.9738853"}
```
Each field begins with an STX character (start of text, 0x02) and ends with an ETX character (end of text, 0x03). In the file, STX appears as **^B** and ETX appears as **^C**.

**Example From HDFS to DynamoDB**  
With a single HiveQL statement, you can populate the DynamoDB table using the data from HDFS:  

```
INSERT OVERWRITE TABLE ddb_features_no_mapping
SELECT * FROM hdfs_features_no_mapping;
```

## Accessing the data in HDFS
<a name="EMRforDynamoDB.CopyingData.HDFS.ViewingData"></a>

HDFS is a distributed file system, accessible to all of the nodes in the Amazon EMR cluster. If you use SSH to connect to the leader node, you can use command line tools to access the data that Hive wrote to HDFS.

HDFS is not the same thing as the local file system on the leader node. You cannot work with files and directories in HDFS using standard Linux commands (such as `cat`, `cp`, `mv`, or `rm`). Instead, you perform these tasks using the `hadoop fs` command.

The following steps are written with the assumption you have copied data from DynamoDB to HDFS using one of the procedures in this section.

1. If you are currently at the Hive command prompt, exit to the Linux command prompt.

   ```
   hive> exit;
   ```

1. List the contents of the /user/hadoop/hive-test directory in HDFS. (This is where Hive copied the data from DynamoDB.)

   ```
   hadoop fs -ls /user/hadoop/hive-test
   ```

   The response should look similar to this:

   ```
   Found 1 items
   -rw-r--r-- 1 hadoop hadoop 29504 2016-06-08 23:40 /user/hadoop/hive-test/000000_0
   ```

   The file name (*000000\$10*) is system-generated.

1. View the contents of the file:

   ```
   hadoop fs -cat /user/hadoop/hive-test/000000_0
   ```
**Note**  
In this example, the file is relatively small (approximately 29 KB). Be careful when you use this command with files that are very large or contain non-printable characters.

1. (Optional) You can copy the data file from HDFS to the local file system on the leader node. After you do this, you can use standard Linux command line utilities to work with the data in the file.

   ```
   hadoop fs -get /user/hadoop/hive-test/000000_0
   ```

   This command will not overwrite the file.
**Note**  
The local file system on the leader node has limited capacity. Do not use this command with files that are larger than the available space in the local file system.

# Using data compression
<a name="EMRforDynamoDB.CopyingData.Compression"></a>

When you use Hive to copy data among different data sources, you can request on-the-fly data compression. Hive provides several compression codecs. You can choose one during your Hive session. When you do this, the data is compressed in the specified format. 

The following example compresses data using the Lempel-Ziv-Oberhumer (LZO) algorithm. 

```
 1. SET hive.exec.compress.output=true;
 2. SET io.seqfile.compression.type=BLOCK;
 3. SET mapred.output.compression.codec = com.hadoop.compression.lzo.LzopCodec;
 4. 
 5. CREATE EXTERNAL TABLE lzo_compression_table (line STRING)
 6. ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
 7. LOCATION 's3://bucketname/path/subpath/';
 8. 
 9. INSERT OVERWRITE TABLE lzo_compression_table SELECT *
10. FROM hiveTableName;
```

The resulting file in Amazon S3 will have a system-generated name with `.lzo` at the end (for example, `8d436957-57ba-4af7-840c-96c2fc7bb6f5-000000.lzo`).

The available compression codecs are:
+ `org.apache.hadoop.io.compress.GzipCodec`
+ `org.apache.hadoop.io.compress.DefaultCodec`
+ `com.hadoop.compression.lzo.LzoCodec`
+ `com.hadoop.compression.lzo.LzopCodec`
+ `org.apache.hadoop.io.compress.BZip2Codec`
+ `org.apache.hadoop.io.compress.SnappyCodec`

# Reading non-printable UTF-8 character data
<a name="EMRforDynamoDB.CopyingData.NonPrintableData"></a>

To read and write non-printable UTF-8 character data, you can use the `STORED AS SEQUENCEFILE` clause when you create a Hive table. A SequenceFile is a Hadoop binary file format. You need to use Hadoop to read this file. The following example shows how to export data from DynamoDB into Amazon S3. You can use this functionality to handle non-printable UTF-8 encoded characters. 

```
1. CREATE EXTERNAL TABLE s3_export(a_col string, b_col bigint, c_col array<string>)
2. STORED AS SEQUENCEFILE
3. LOCATION 's3://bucketname/path/subpath/';
4. 
5. INSERT OVERWRITE TABLE s3_export SELECT *
6. FROM hiveTableName;
```

# Performance tuning
<a name="EMRforDynamoDB.PerformanceTuning"></a>

When you create a Hive external table that maps to a DynamoDB table, you do not consume any read or write capacity from DynamoDB. However, read and write activity on the Hive table (such as `INSERT` or `SELECT`) translates directly into read and write operations on the underlying DynamoDB table.

Apache Hive on Amazon EMR implements its own logic for balancing the I/O load on the DynamoDB table and seeks to minimize the possibility of exceeding the table's provisioned throughput. At the end of each Hive query, Amazon EMR returns runtime metrics, including the number of times your provisioned throughput was exceeded. You can use this information, together with CloudWatch metrics on your DynamoDB table, to improve performance in subsequent requests.

The Amazon EMR console provides basic monitoring tools for your cluster. For more information, see [View and Monitor a Cluster](https://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-manage-view.html) in the *Amazon EMR Management Guide*.

You can also monitor your cluster and Hadoop jobs using web-based tools, such as Hue, Ganglia, and the Hadoop web interface. For more information, see [View Web Interfaces Hosted on Amazon EMR Clusters](https://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-web-interfaces.html) in the *Amazon EMR Management Guide*.

This section describes steps you can take to performance-tune Hive operations on external DynamoDB tables. 

**Topics**
+ [DynamoDB provisioned throughput](EMRforDynamoDB.PerformanceTuning.Throughput.md)
+ [Adjusting the mappers](EMRforDynamoDB.PerformanceTuning.Mappers.md)
+ [Additional topics](EMRforDynamoDB.PerformanceTuning.Misc.md)

# DynamoDB provisioned throughput
<a name="EMRforDynamoDB.PerformanceTuning.Throughput"></a>

When you issue HiveQL statements against the external DynamoDB table, the `DynamoDBStorageHandler` class makes the appropriate low-level DynamoDB API requests, which consume provisioned throughput. If there is not enough read or write capacity on the DynamoDB table, the request will be throttled, resulting in slow HiveQL performance. For this reason, you should ensure that the table has enough throughput capacity.

For example, suppose that you have provisioned 100 read capacity units for your DynamoDB table. This will let you read 409,600 bytes per second (100 × 4 KB read capacity unit size). Now suppose that the table contains 20 GB of data (21,474,836,480 bytes) and you want to use the `SELECT` statement to select all of the data using HiveQL. You can estimate how long the query will take to run like this:

 * 21,474,836,480 / 409,600 = 52,429 seconds = 14.56 hours * 

In this scenario, the DynamoDB table is a bottleneck. It won't help to add more Amazon EMR nodes, because the Hive throughput is constrained to only 409,600 bytes per second. The only way to decrease the time required for the `SELECT` statement is to increase the provisioned read capacity of the DynamoDB table. 

You can perform a similar calculation to estimate how long it would take to bulk-load data into a Hive external table mapped to a DynamoDB table. Determine the total number of write capacity units needed per item (less than 1KB = 1, 1-2KB = 2, etc), and multiply that by the number of items to load. This will give you the number of write capacity units required. Divide that number by the number of write capacity units that are allocated per second. This will yield the number of seconds it will take to load the table.

You should regularly monitor the CloudWatch metrics for your table. For a quick overview in the DynamoDB console, choose your table and then choose the **Metrics** tab. From here, you can view read and write capacity units consumed and read and write requests that have been throttled.

## Read capacity
<a name="EMRforDynamoDB.PerformanceTuning.Throughput.ReadCapacity"></a>

Amazon EMR manages the request load against your DynamoDB table, according to the table's provisioned throughput settings. However, if you notice a large number of `ProvisionedThroughputExceeded` messages in the job output, you can adjust the default read rate. To do this, you can modify the `dynamodb.throughput.read.percent` configuration variable. You can use the `SET` command to set this variable at the Hive command prompt:

```
1. SET dynamodb.throughput.read.percent=1.0;
```

This variable persists for the current Hive session only. If you exit Hive and return to it later, `dynamodb.throughput.read.percent` will return to its default value.

The value of `dynamodb.throughput.read.percent` can be between `0.1` and `1.5`, inclusively. `0.5` represents the default read rate, meaning that Hive will attempt to consume half of the read capacity of the table. If you increase the value above `0.5`, Hive will increase the request rate; decreasing the value below `0.5` decreases the read request rate. (The actual read rate will vary, depending on factors such as whether there is a uniform key distribution in the DynamoDB table.)

If you notice that Hive is frequently depleting the provisioned read capacity of the table, or if your read requests are being throttled too much, try reducing `dynamodb.throughput.read.percent` below `0.5`. If you have sufficient read capacity in the table and want more responsive HiveQL operations, you can set the value above `0.5`.

## Write capacity
<a name="EMRforDynamoDB.PerformanceTuning.Throughput.WriteCapacity"></a>

Amazon EMR manages the request load against your DynamoDB table, according to the table's provisioned throughput settings. However, if you notice a large number of `ProvisionedThroughputExceeded` messages in the job output, you can adjust the default write rate. To do this, you can modify the `dynamodb.throughput.write.percent` configuration variable. You can use the `SET` command to set this variable at the Hive command prompt:

```
1. SET dynamodb.throughput.write.percent=1.0;
```

This variable persists for the current Hive session only. If you exit Hive and return to it later, `dynamodb.throughput.write.percent` will return to its default value.

The value of `dynamodb.throughput.write.percent` can be between `0.1` and `1.5`, inclusively. `0.5` represents the default write rate, meaning that Hive will attempt to consume half of the write capacity of the table. If you increase the value above `0.5`, Hive will increase the request rate; decreasing the value below `0.5` decreases the write request rate. (The actual write rate will vary, depending on factors such as whether there is a uniform key distribution in the DynamoDB table.)

If you notice that Hive is frequently depleting the provisioned write capacity of the table, or if your write requests are being throttled too much, try reducing `dynamodb.throughput.write.percent` below `0.5`. If you have sufficient capacity in the table and want more responsive HiveQL operations, you can set the value above `0.5`.

When you write data to DynamoDB using Hive, ensure that the number of write capacity units is greater than the number of mappers in the cluster. For example, consider an Amazon EMR cluster consisting of 10 *m1.xlarge* nodes. The *m1.xlarge* node type provides 8 mapper tasks, so the cluster would have a total of 80 mappers (10 × 8). If your DynamoDB table has fewer than 80 write capacity units, then a Hive write operation could consume all of the write throughput for that table.

To determine the number of mappers for Amazon EMR node types, see [Task Configuration](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop-task-config.html) in the *Amazon EMR Developer Guide*.

For more information on mappers, see [Adjusting the mappers](EMRforDynamoDB.PerformanceTuning.Mappers.md).

# Adjusting the mappers
<a name="EMRforDynamoDB.PerformanceTuning.Mappers"></a>

When Hive launches a Hadoop job, the job is processed by one or more mapper tasks. Assuming that your DynamoDB table has sufficient throughput capacity, you can modify the number of mappers in the cluster, potentially improving performance.

**Note**  
The number of mapper tasks used in a Hadoop job are influenced by *input splits*, where Hadoop subdivides the data into logical blocks. If Hadoop does not perform enough input splits, then your write operations might not be able to consume all the write throughput available in the DynamoDB table. 

## Increasing the number of mappers
<a name="EMRforDynamoDB.PerformanceTuning.Mappers.Increasing"></a>

Each mapper in an Amazon EMR has a maximum read rate of 1 MiB per second. The number of mappers in a cluster depends on the size of the nodes in your cluster. (For information about node sizes and the number of mappers per node, see [Task Configuration](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop-task-config.html) in the *Amazon EMR Developer Guide*.)

If your DynamoDB table has ample throughput capacity for reads, you can try increasing the number of mappers by doing one of the following:
+ Increase the size of the nodes in your cluster. For example, if your cluster is using *m1.large* nodes (three mappers per node), you can try upgrading to *m1.xlarge* nodes (eight mappers per node).
+ Increase the number of nodes in your cluster. For example, if you have three-node cluster of *m1.xlarge* nodes, you have a total of 24 mappers available. If you were to double the size of the cluster, with the same type of node, you would have 48 mappers.

You can use the AWS Management Console to manage the size or the number of nodes in your cluster. (You might need to restart the cluster for these changes to take effect.)

Another way to increase the number of mappers is to modify the `mapred.tasktracker.map.tasks.maximum` Hadoop configuration parameter. (This is a Hadoop parameter, not a Hive parameter. You cannot modify it interactively from the command prompt.). If you increase the value of `mapred.tasktracker.map.tasks.maximum`, you can increase the number of mappers without increasing the size or number of nodes. However, it is possible for the cluster nodes to run out of memory if you set the value too high.

You set the value for `mapred.tasktracker.map.tasks.maximum` as a bootstrap action when you first launch your Amazon EMR cluster. For more information, see [(Optional) Create Bootstrap Actions to Install Additional Software](https://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-plan-bootstrap.html) in the *Amazon EMR Management Guide*.

## Decreasing the number of mappers
<a name="EMRforDynamoDB.PerformanceTuning.Mappers.Decreasing"></a>

If you use the `SELECT` statement to select data from an external Hive table that maps to DynamoDB, the Hadoop job can use as many tasks as necessary, up to the maximum number of mappers in the cluster. In this scenario, it is possible that a long-running Hive query can consume all of the provisioned read capacity of the DynamoDB table, negatively impacting other users.

You can use the `dynamodb.max.map.tasks` parameter to set an upper limit for map tasks:

```
SET dynamodb.max.map.tasks=1
```

This value must be equal to or greater than 1. When Hive processes your query, the resulting Hadoop job will use no more than `dynamodb.max.map.tasks` when reading from the DynamoDB table.

# Additional topics
<a name="EMRforDynamoDB.PerformanceTuning.Misc"></a>

The following are some more ways to tune applications that use Hive to access DynamoDB.

## Retry duration
<a name="EMRforDynamoDB.PerformanceTuning.Misc.RetryDuration"></a>

By default, Hive will rerun a Hadoop job if it has not returned any results from DynamoDB within two minutes. You can adjust this interval by modifying the `dynamodb.retry.duration` parameter:

```
1. SET dynamodb.retry.duration=2;
```

The value must be a nonzero integer, representing the number of minutes in the retry interval. The default for `dynamodb.retry.duration` is 2 (minutes).

## Parallel data requests
<a name="EMRforDynamoDB.PerformanceTuning.Misc.ParallelDataRequests"></a>

Multiple data requests, either from more than one user or more than one application to a single table can drain read provisioned throughput and slow performance. 

## Process duration
<a name="EMRforDynamoDB.PerformanceTuning.Misc.ProcessDuration"></a>

Data consistency in DynamoDB depends on the order of read and write operations on each node. While a Hive query is in progress, another application might load new data into the DynamoDB table or modify or delete existing data. In this case, the results of the Hive query might not reflect changes made to the data while the query was running. 

## Request time
<a name="EMRforDynamoDB.PerformanceTuning.Misc.RequestTime"></a>

Scheduling Hive queries that access a DynamoDB table when there is lower demand on the DynamoDB table improves performance. For example, if most of your application's users live in San Francisco, you might choose to export daily data at 4:00 A.M. PST when the majority of users are asleep and not updating records in your DynamoDB database. 

