

 Amazon Redshift will no longer support the creation of new Python UDFs starting Patch 198. Existing Python UDFs will continue to function until June 30, 2026. For more information, see the [ blog post ](https://aws.amazon.com/blogs/big-data/amazon-redshift-python-user-defined-functions-will-reach-end-of-support-after-june-30-2026/). 

# Tutorials for Amazon Redshift ML
<a name="tutorials_for_amazon_redshift_ml"></a>

You can use Amazon Redshift ML to train machine learning models using SQL statements, and then invoke the models in SQL queries for prediction. Machine learning in Amazon Redshift trains a model with one SQL command. Amazon Redshift automatically launches a training job in Amazon SageMaker AI and generates a model. Once a model is created, you can perform predictions in Amazon Redshift using the model’s prediction function.

Follow the steps in these tutorials to learn about Amazon Redshift ML features:
+ [Tutorial: Building customer churn models](tutorial_customer_churn.md) – In this tutorial, you use Amazon Redshift ML to create a customer churn model with the CREATE MODEL command, and run prediction queries for user scenarios. Then, you implement queries using the SQL function that the CREATE MODEL command generates.
+ [Tutorial: Building K-means clustering models](tutorial_k-means_clustering.md) – In this tutorial, you use Amazon Redshift ML to create, train, and deploy a machine learning model based on the [K-means algorithm](url-sm-dev;k-means.html).
+ [Tutorial: Building multi-class classification models](tutorial_multi-class_classification.md) – In this tutorial, you use Amazon Redshift ML to create a machine learning model that solves multi-class classification problems. The multi-class classification algorithm classifies data points into one of three or more classes. Then, you implement queries using the SQL function that the CREATE MODEL command generates. 
+ [Tutorial: Building XGBoost models](tutorial_xgboost.md) – In this tutorial, you create a model with data from Amazon S3 and run prediction queries with the model using Amazon Redshift ML. The XGBoost algorithm is an optimized implementation of the gradient boosted trees algorithm. 
+ [Tutorial: Building regression models](tutorial_regression.md) – In this tutorial, you use Amazon Redshift ML to create a machine learning regression model and run prediction queries on the model. Regression models allow you to predict numerical outcomes, such as the price of a house, or how many people will use a city’s bike rental service. 
+ [Tutorial: Building regression models with linear learner](tutorial_linear_learner_regression.md) – In this tutorial, you create a linear learner model with data from Amazon S3 and run prediction queries with the model using Amazon Redshift ML. The SageMaker AI linear learner algorithm solves either regression or multi-class classification problems. 
+ [Tutorial: Building multi-class classification models with linear learner](tutorial_linear_learner_multi-class_classification.md) – In this tutorial, you create a linear learner model with data from Amazon S3, and then run prediction queries with the model using Amazon Redshift ML. The SageMaker AI linear learner algorithm solves either regression or classification problems. 

# Tutorial: Building customer churn models
<a name="tutorial_customer_churn"></a>

In this tutorial, you use Amazon Redshift ML to create a customer churn model with the CREATE MODEL command, and run prediction queries for user scenarios. Then, you implement queries using the SQL function that the CREATE MODEL command generates.

You can use a simple CREATE MODEL command to export training data, train a model, import the model, and prepare an Amazon Redshift prediction function. Use the CREATE MODEL statement to specify training data either as a table or SELECT statement.

This example uses historical information to construct a machine learning model of a mobile operator’s customer churn. First, SageMaker AI trains your machine learning model and then tests your model using the profile information of an arbitrary customer. After the model is validated, Amazon SageMaker AI deploys the model and the prediction function to Amazon Redshift. You can use the prediction function to predict whether a customer is going to churn or not.

## Use case examples
<a name="tutorial_customer_churn_tasks"></a>

You can solve other binary classification problems using Amazon Redshift ML, such as predicting if a sales lead will close or not. You could also predict whether a financial transaction is fraudulent or not.

**Tasks**
+ Prerequisites
+ Step 1: Load the data from Amazon S3 to Amazon Redshift
+ Step 2: Create the machine learning model
+ Step 3: Perform predictions with the model

## Prerequisites
<a name="tutorial_customer_churn_prereqs"></a>

To complete this tutorial, you must have the following prerequisites:
+ You must set up an Amazon Redshift cluster for Amazon Redshift ML. To do so, use the documentation for [Cluster and configure setup for Amazon Redshift ML administration.](https://docs.aws.amazon.com/redshift/latest/dg/admin-setup.html)
+ The Amazon Redshift cluster that you use to create the model, and the Amazon S3 bucket that you use to stage the training data and store the model artifacts must be in the same AWS Region.
+ To download the SQL commands and the sample dataset used in this documentation, do one of the following:
  + Download the [SQL commands](https://s3.amazonaws.com/redshift-downloads/redshift-ml/tutorial-scripts/redshift-ml-tutorial.sql), [Customer activity file](https://s3.amazonaws.com/redshift-downloads/redshift-ml/customer_activity/customer_activity.csv), and [Abalone file](https://s3.amazonaws.com/redshift-downloads/redshift-ml/abalone_xg/abalone.csv).
  + Using the AWS CLI for Amazon S3, run the following command. You can use your own target path.

    ```
    aws s3 cp s3://redshift-downloads/redshift-ml/tutorial-scripts/redshift-ml-tutorial.sql </target/path>
    aws s3 cp s3://redshift-downloads/redshift-ml/customer_activity/customer_activity.csv </target/path>
    aws s3 cp s3://redshift-downloads/redshift-ml/abalone_xgb/abalone_xgb.csv </target/path>
    ```

## Step 1: Load the data from Amazon S3 to Amazon Redshift
<a name="tutorial_customer_churn_step_load"></a>

Use the [Amazon Redshift query editor v2](https://docs.aws.amazon.com/redshift/latest/mgmt/query-editor-v2-using.html) to edit and run queries and visualize results.

Running the following queries creates a table named `customer_activity` and ingests the sample dataset from Amazon S3.

```
DROP TABLE IF EXISTS customer_activity;

CREATE TABLE customer_activity (
state varchar(2),
account_length int,
area_code int,
phone varchar(8),
intl_plan varchar(3),
vMail_plan varchar(3),
vMail_message int,
day_mins float,
day_calls int,
day_charge float,
total_charge float,
eve_mins float,
eve_calls int,
eve_charge float,
night_mins float,
night_calls int,
night_charge float,
intl_mins float,
intl_calls int,
intl_charge float,
cust_serv_calls int,
churn varchar(6),
record_date date
);

COPY customer_activity
FROM 's3://redshift-downloads/redshift-ml/customer_activity/'
REGION 'us-east-1' IAM_ROLE default
FORMAT AS CSV IGNOREHEADER 1;
```

## Step 2: Create the machine learning model
<a name="tutorial_customer_churn_step_create_model"></a>

Churn is our target input in this model. All other inputs for the model are attributes that help to create a function to predict churn.

The following example uses the CREATE MODEL operation to deliver a model that predicts whether a customer will be active, using inputs such as the customer’s age, postal code, spending, and cases. In the following example, replace amzn-s3-demo-bucket with your own Amazon S3 bucket.

```
CREATE MODEL customer_churn_auto_model
FROM
    (
      SELECT state,
             account_length,
             area_code,
             total_charge/account_length AS average_daily_spend,
             cust_serv_calls/account_length AS average_daily_cases,
             churn
      FROM customer_activity
      WHERE  record_date < '2020-01-01'
     )
TARGET churn FUNCTION ml_fn_customer_churn_auto
IAM_ROLE default SETTINGS (
  S3_BUCKET 'amzn-s3-demo-bucket'
);
```

The SELECT query in the preceding example creates the training data. The TARGET clause specifies which column is the machine learning label that the CREATE MODEL operation uses to learn how to predict. The target column “churn” indicates whether the customer still has an active membership or has suspended the membership. The S3\$1BUCKET field is the name of the Amazon S3 bucket that you previously created. The Amazon S3 bucket is used to share training data and artifacts between Amazon Redshift and Amazon SageMaker AI. The remaining columns are the features that are used for the prediction.

For a summary of the syntax and features of a basic use case of the CREATE MODEL command, see [Simple CREATE MODEL](https://docs.aws.amazon.com/redshift/latest/dg/r_create_model_use_cases.html#r_simple_create_model).

### Add permissions for server-side encryption (optional)
<a name="tutorial_customer_churn_encryption"></a>

Amazon Redshift by default uses Amazon SageMaker AI Autopilot for training. In particular, Amazon Redshift securely exports the training data to the customer-specified Amazon S3 bucket. If you don’t specify a `KMS_KEY_ID`, then the data is encrypted using server-side encryption SSE-S3 by default.

When you encrypt your input using server-side encryption with a AWS KMS managed key (SSE-MMS), then add the following permissions:

```
{
    "Effect": "Allow",
    "Action": [
    "kms:Encrypt"
    "kms:Decrypt"
    ]
}
```

For more information about Amazon SageMaker AI roles, see [Amazon SageMaker AI roles](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) in the *Amazon SageMaker AI Developer Guide*.

### Check the status of model training (optional)
<a name="tutorial_customer_churn_check_status"></a>

You can use the SHOW MODEL command to know when your model is ready.

Use the following operation to check the status of the model.

```
SHOW MODEL customer_churn_auto_model;
```

The following is an example of the output of the previous operation.

```
+--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
|           Key            |                                                                             Value                                                                             |
+--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
|        Model Name        |                                                                   customer_churn_auto_model                                                                   |
|       Schema Name        |                                                                            public                                                                             |
|          Owner           |                                                                            awsuser                                                                            |
|      Creation Time       |                                                                   Tue, 14.06.2022 17:15:52                                                                    |
|       Model State        |                                                                           TRAINING                                                                            |
|                          |                                                                                                                                                               |
|      TRAINING DATA:      |                                                                                                                                                               |
|          Query           | SELECT STATE, ACCOUNT_LENGTH, AREA_CODE, TOTAL_CHARGE / ACCOUNT_LENGTH AS AVERAGE_DAILY_SPEND, CUST_SERV_CALLS / ACCOUNT_LENGTH AS AVERAGE_DAILY_CASES, CHURN |
|                          |                                                                    FROM CUSTOMER_ACTIVITY                                                                     |
|                          |                                                               WHERE RECORD_DATE < '2020-01-01'                                                                |
|      Target Column       |                                                                             CHURN                                                                             |
|                          |                                                                                                                                                               |
|       PARAMETERS:        |                                                                                                                                                               |
|        Model Type        |                                                                             auto                                                                              |
|       Problem Type       |                                                                                                                                                               |
|        Objective         |                                                                                                                                                               |
|     AutoML Job Name      |                                                                redshiftml-20220614171552640901                                                                |
|      Function Name       |                                                                   ml_fn_customer_churn_auto                                                                   |
|   Function Parameters    |                                            state account_length area_code average_daily_spend average_daily_cases                                             |
| Function Parameter Types |                                                                 varchar int4 int4 float8 int4                                                                 |
|         IAM Role         |                                                                     default-aws-iam-role                                                                      |
|        S3 Bucket         |                                                                        amzn-s3-demo-bucket                                                                     |
|       Max Runtime        |                                                                             5400                                                                              |
+--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
```

When the model training is complete, the `model_state` variable becomes `Model is Ready`, and the prediction function becomes available.

## Step 3: Perform predictions with the model
<a name="tutorial_customer_churn_step_perform_predictions"></a>

You can use SQL statements to view the predictions made by the prediction model. In this example, the prediction function created by the CREATE MODEL operation is named `ml_fn_customer_churn_auto`. The input arguments for the prediction function correspond to the types of the features, such as varchar for the `state` and integer for `account_length`. The output of the prediction function is the same type as the TARGET column of the CREATE MODEL statement.

1. You trained the model on data from before 2020-01-01, so now you use the prediction function on the testing set. The following query displays the predictions of whether customers who signed up after 2020-01-01 will go through churn or not.

   ```
   SELECT
       phone,
       ml_fn_customer_churn_auto(
           state,
           account_length,
           area_code,
           total_charge / account_length,
           cust_serv_calls / account_length
       ) AS active
   FROM
       customer_activity
   WHERE
       record_date > '2020-01-01';
   ```

1. The following example uses the same prediction function for a different use case. In this case, Amazon Redshift predicts the proportion of churners and non-churners among customers from different states where the record date is greater than 2020-01-01.

   ```
   WITH predicted AS (
       SELECT
           state,
           ml_fn_customer_churn_auto(
               state,
               account_length,
               area_code,
               total_charge / account_length,
               cust_serv_calls / account_length
           ) :: varchar(6) AS active
       FROM
           customer_activity
       WHERE
           record_date > '2020-01-01'
   )
   SELECT
       state,
       SUM(
           CASE
               WHEN active = 'True.' THEN 1
               ELSE 0
           END
       ) AS churners,
       SUM(
           CASE
               WHEN active = 'False.' THEN 1
               ELSE 0
           END
       ) AS nonchurners,
       COUNT(*) AS total_per_state
   FROM
       predicted
   GROUP BY
       state
   ORDER BY
       state;
   ```

1. The following example uses the prediction function for the use case of predicting the percentage of customers who churn in a state. In this case, Amazon Redshift predicts the churn percentage where the record date is greater than 2020-01-01.

   ```
   WITH predicted AS (
       SELECT
           state,
           ml_fn_customer_churn_auto(
               state,
               account_length,
               area_code,
               total_charge / account_length,
               cust_serv_calls / account_length
           ) :: varchar(6) AS active
       FROM
           customer_activity
       WHERE
           record_date > '2020-01-01'
   )
   SELECT
       state,
       CAST((CAST((SUM(
           CASE
               WHEN active = 'True.' THEN 1
               ELSE 0
           END
       )) AS FLOAT) / CAST(COUNT(*) AS FLOAT)) AS DECIMAL (3, 2)) AS pct_churn,
       COUNT(*) AS total_customers_per_state
   FROM
       predicted
   GROUP BY
       state
   ORDER BY
       3 DESC;
   ```

## Related topics
<a name="tutorial_customer_churn_related_topics"></a>

For more information about Amazon Redshift ML, see the following documentation:
+ [Costs for using Amazon RedshiftML](https://docs.aws.amazon.com/redshift/latest/dg/cost.html)
+ [CREATE MODEL command](https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_MODEL.html)
+ [EXPLAIN\$1MODEL function](https://docs.aws.amazon.com/redshift/latest/dg/r_explain_model_function.html)

For more information about machine learning, see the following documentation:
+ [Machine learning overview](https://docs.aws.amazon.com/redshift/latest/dg/machine_learning_overview.html)
+ [Machine learning for novices and experts](https://docs.aws.amazon.com/redshift/latest/dg/novice_expert.html)
+ [What Is Fairness and Model Explainability for Machine Learning Predictions?](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-fairness-and-explainability.html)

# Tutorial: Building K-means clustering models
<a name="tutorial_k-means_clustering"></a>

In this tutorial, you use Amazon Redshift ML to create, train, and deploy a machine learning model based on the [K-means algorithm](url-sm-dev;k-means.html). This algorithm solves clustering problems where you want to discover groupings in the data. K-means helps in grouping data that has not been labeled yet. To learn more about K-means clustering, see [How K-means Clustering Works](https://docs.aws.amazon.com/sagemaker/latest/dg/algo-kmeans-tech-notes.html)in the Amazon SageMaker AI Developer Guide.

You will use a CREATE MODEL operation to create a K-means model from a Amazon Redshift cluster. You can use a CREATE MODEL command to export training data, train a model, import the model, and prepare an Amazon Redshift prediction function. Use the CREATE MODEL operation to specify training data either as a table or a SELECT statement.

In this tutorial, you use K-means on the [Global Database of Events, Language, and Tone (GDELT)](https://aws.amazon.com/public-datasets/gdelt/) dataset, which monitors world news across the world, and the data is stored for every second of every day. K-means will group events that have similar tone, actors, or locations. The data is stored as multiple files on Amazon Simple Storage Service, in two different folders. The folders are historical, which cover the years 1979–2013, and daily updates, which cover the years 2013 and later. For this example, we use the historical format and bring in 1979 data.

## Use case examples
<a name="tutorial_k-means_clustering_tasks"></a>

You can solve other clustering problems with Amazon Redshift ML, such as grouping customers who have similar viewing habits on a streaming service. You could also use Redshift ML to predict the optimal number of shipping centers for a delivery service.

**Tasks**
+ Prerequisites
+ Step 1: Load the data from Amazon S3 to Amazon Redshift
+ Step 2: Create the machine learning model
+ Step 3: Perform predictions with the model

## Prerequisites
<a name="tutorial_k-means_clustering_prereqs"></a>

To complete this tutorial, you must complete the [Administrative setup](https://docs.aws.amazon.com/redshift/latest/dg/admin-setup.html) for Amazon Redshift ML.

## Step 1: Load the data from Amazon S3 to Amazon Redshift
<a name="tutorial_k-means_clustering_step_load"></a>

1. Use the [Amazon Redshift query editor v2](https://docs.aws.amazon.com/redshift/latest/mgmt/query-editor-v2-using.html) to run the following query. The query drops the `gdelt_data` table in the public schema if it exists and creates a table of the same name in the public schema.

   ```
   DROP TABLE IF EXISTS gdelt_data CASCADE;
   
   CREATE TABLE gdelt_data (
       GlobalEventId bigint,
       SqlDate bigint,
       MonthYear bigint,
       Year bigint,
       FractionDate double precision,
       Actor1Code varchar(256),
       Actor1Name varchar(256),
       Actor1CountryCode varchar(256),
       Actor1KnownGroupCode varchar(256),
       Actor1EthnicCode varchar(256),
       Actor1Religion1Code varchar(256),
       Actor1Religion2Code varchar(256),
       Actor1Type1Code varchar(256),
       Actor1Type2Code varchar(256),
       Actor1Type3Code varchar(256),
       Actor2Code varchar(256),
       Actor2Name varchar(256),
       Actor2CountryCode varchar(256),
       Actor2KnownGroupCode varchar(256),
       Actor2EthnicCode varchar(256),
       Actor2Religion1Code varchar(256),
       Actor2Religion2Code varchar(256),
       Actor2Type1Code varchar(256),
       Actor2Type2Code varchar(256),
       Actor2Type3Code varchar(256),
       IsRootEvent bigint,
       EventCode bigint,
       EventBaseCode bigint,
       EventRootCode bigint,
       QuadClass bigint,
       GoldsteinScale double precision,
       NumMentions bigint,
       NumSources bigint,
       NumArticles bigint,
       AvgTone double precision,
       Actor1Geo_Type bigint,
       Actor1Geo_FullName varchar(256),
       Actor1Geo_CountryCode varchar(256),
       Actor1Geo_ADM1Code varchar(256),
       Actor1Geo_Lat double precision,
       Actor1Geo_Long double precision,
       Actor1Geo_FeatureID bigint,
       Actor2Geo_Type bigint,
       Actor2Geo_FullName varchar(256),
       Actor2Geo_CountryCode varchar(256),
       Actor2Geo_ADM1Code varchar(256),
       Actor2Geo_Lat double precision,
       Actor2Geo_Long double precision,
       Actor2Geo_FeatureID bigint,
       ActionGeo_Type bigint,
       ActionGeo_FullName varchar(256),
       ActionGeo_CountryCode varchar(256),
       ActionGeo_ADM1Code varchar(256),
       ActionGeo_Lat double precision,
       ActionGeo_Long double precision,
       ActionGeo_FeatureID bigint,
       DATEADDED bigint
   );
   ```

1. The following query loads the sample data into the `gdelt_data` table.

   ```
   COPY gdelt_data
   FROM 's3://gdelt-open-data/events/1979.csv' 
   REGION 'us-east-1'
   IAM_ROLE default 
   CSV
   DELIMITER '\t';
   ```

### Examine the training data (optional)
<a name="tutorial_k-means_clustering_examine"></a>

To see what data your model will be trained on, use the following query.

```
SELECT
    AvgTone,
    EventCode,
    NumArticles,
    Actor1Geo_Lat,
    Actor1Geo_Long,
    Actor2Geo_Lat,
    Actor2Geo_Long
FROM
    gdelt_data LIMIT 100;
```

## Step 2: Create the machine learning model
<a name="tutorial_k-means_clustering_create_model"></a>

The following example uses the CREATE MODEL command to create a model that groups the data into seven clusters. The K value is the number of clusters that your data points are divided into. The model classifies your data points into clusters where data points are more similar to each other. By clustering the data points into groups, the K-Means algorithm iteratively determines the best cluster center. The algorithm then assigns each data point to the closest cluster center. Members nearest the same cluster center belong to the same group. Members of a group are as similar as possible to other members in the same group, and as different as possible from members of other groups. The K value is subjective and depends on methods that measure the similarities among data points. You can change the K value to smooth out cluster sizes if the clusters are unevenly distributed.

In the following example, replace amzn-s3-demo-bucket with your own Amazon S3 bucket.

```
CREATE MODEL news_data_clusters
FROM
    (
        SELECT
            AvgTone,
            EventCode,
            NumArticles,
            Actor1Geo_Lat,
            Actor1Geo_Long,
            Actor2Geo_Lat,
            Actor2Geo_Long
        FROM
            gdelt_data
    ) FUNCTION news_monitoring_cluster 
    IAM_ROLE default 
    AUTO OFF 
    MODEL_TYPE KMEANS 
    PREPROCESSORS 'none' 
    HYPERPARAMETERS DEFAULT
    EXCEPT
    (K '7') 
    SETTINGS (S3_BUCKET 'amzn-s3-demo-bucket');
```

### Check the status of model training (optional)
<a name="tutorial_k-means_clustering_check_status"></a>

You can use the SHOW MODEL command to know when your model is ready.

To check the model status, use the following SHOW MODEL operation and find if the `Model State` is `Ready`.

```
SHOW MODEL NEWS_DATA_CLUSTERS;
```

When the model is ready, the output of the previous operation should show that the `Model State` is `Ready`. The following is an example of the output of the SHOW MODEL operation.

```
+--------------------------+------------------------------------------------------------------------------------------------------+
|        Model Name        |                                          news_data_clusters                                          |
+--------------------------+------------------------------------------------------------------------------------------------------+
|       Schema Name        |                                                public                                                |
|          Owner           |                                               awsuser                                                |
|      Creation Time       |                                       Fri, 17.06.2022 16:32:19                                       |
|       Model State        |                                                READY                                                 |
|        train:msd         |                                             2973.822754                                              |
|      train:progress      |                                              100.000000                                              |
|     train:throughput     |                                            237114.875000                                             |
|      Estimated Cost      |                                               0.004983                                               |
|                          |                                                                                                      |
|      TRAINING DATA:      |                                                                                                      |
|          Query           | SELECT AVGTONE, EVENTCODE, NUMARTICLES, ACTOR1GEO_LAT, ACTOR1GEO_LONG, ACTOR2GEO_LAT, ACTOR2GEO_LONG |
|                          |                                           FROM GDELT_DATA                                            |
|                          |                                                                                                      |
|       PARAMETERS:        |                                                                                                      |
|        Model Type        |                                                kmeans                                                |
|    Training Job Name     |                                redshiftml-20220617163219978978-kmeans                                |
|      Function Name       |                                       news_monitoring_cluster                                        |
|   Function Parameters    |       avgtone eventcode numarticles actor1geo_lat actor1geo_long actor2geo_lat actor2geo_long        |
| Function Parameter Types |                             float8 int8 int8 float8 float8 float8 float8                             |
|         IAM Role         |                                         default-aws-iam-role                                         |
|        S3 Bucket         |                                            amzn-s3-demo-bucket                                       |
|       Max Runtime        |                                                 5400                                                 |
|                          |                                                                                                      |
|     HYPERPARAMETERS:     |                                                                                                      |
|       feature_dim        |                                                  7                                                   |
|            k             |                                                  7                                                   |
+--------------------------+------------------------------------------------------------------------------------------------------+
```

## Step 3: Perform predictions with the model
<a name="tutorial_k-means_clustering_step_perform_predictions"></a>

### Identify the clusters
<a name="tutorial_k-means_clustering_identify_clusters"></a>

You can find discrete groupings identified in the data by your model, otherwise known as clusters. A cluster is the set of data points that is closer to its cluster center than any other cluster center. Since the K value represents the number of clusters in the model, it also represents the number of cluster centers. The following query identifies the clusters by showing the cluster associated with each `globaleventid`.

```
SELECT
    globaleventid,
    news_monitoring_cluster (
        AvgTone,
        EventCode,
        NumArticles,
        Actor1Geo_Lat,
        Actor1Geo_Long,
        Actor2Geo_Lat,
        Actor2Geo_Long
    ) AS cluster
FROM
    gdelt_data;
```

### Check the distribution of data
<a name="tutorial_k-means_clustering_check_distribution"></a>

You can check the distribution of data across clusters to see if the K value that you chose caused the data to be somewhat evenly distributed. Use the following query to determine if the data is evenly distributed across your clusters.

```
SELECT
    events_cluster,
    COUNT(*) AS nbr_events
FROM
    (
        SELECT
            globaleventid,
            news_monitoring_cluster(
                AvgTone,
                EventCode,
                NumArticles,
                Actor1Geo_Lat,
                Actor1Geo_Long,
                Actor2Geo_Lat,
                Actor2Geo_Long
            ) AS events_cluster
        FROM
            gdelt_data
    )
GROUP BY
    1;
```

Note that you can change the K value to smooth out cluster sizes if the clusters are unevenly distributed.

### Determine the cluster centers
<a name="tutorial_k-means_clustering_determine_centers"></a>

A data point is closer to its cluster center than it is to any other cluster center. Thus, finding the cluster centers helps you define the clusters.

Run the following query to determine the centers of the clusters based on the number of articles by event code.

```
SELECT
    news_monitoring_cluster (
        AvgTone,
        EventCode,
        NumArticles,
        Actor1Geo_Lat,
        Actor1Geo_Long,
        Actor2Geo_Lat,
        Actor2Geo_Long
    ) AS events_cluster,
    eventcode,
    SUM(numArticles) AS numArticles
FROM
    gdelt_data
GROUP BY
    1,
    2;
```

### Show information about data points in a cluster
<a name="tutorial_k-means_clustering_data_points_info"></a>

Use the following query to return the data for the points assigned to the fifth cluster. The selected articles must have two actors.

```
SELECT
    news_monitoring_cluster (
        AvgTone,
        EventCode,
        NumArticles,
        Actor1Geo_Lat,
        Actor1Geo_Long,
        Actor2Geo_Lat,
        Actor2Geo_Long
    ) AS events_cluster,
    eventcode,
    actor1name,
    actor2name,
    SUM(numarticles) AS totalarticles
FROM
    gdelt_data
WHERE
    events_cluster = 5
    AND actor1name <> ' '
    AND actor2name <> ' '
GROUP BY
    1,
    2,
    3,
    4
ORDER BY
    5 desc;
```

### Show data about events with actors of the same ethnic code
<a name="tutorial_k-means_clustering_show_events_data"></a>

The following query counts the number of articles written about events with a positive tone. The query also requires that the two actors have the same ethnic code and it returns which cluster each event is assigned to.

```
SELECT
    news_monitoring_cluster (
        AvgTone,
        EventCode,
        NumArticles,
        Actor1Geo_Lat,
        Actor1Geo_Long,
        Actor2Geo_Lat,
        Actor2Geo_Long
    ) AS events_cluster,
    SUM(numarticles) AS total_articles,
    eventcode AS event_code,
    Actor1EthnicCode AS ethnic_code
FROM
    gdelt_data
WHERE
    Actor1EthnicCode = Actor2EthnicCode
    AND Actor1EthnicCode <> ' '
    AND Actor2EthnicCode <> ' '
    AND AvgTone > 0
GROUP BY
    1,
    3,
    4
HAVING
    (total_articles) > 4
ORDER BY
    1,
    2 ASC;
```

## Related topics
<a name="tutorial_k-means_clustering_related_topics"></a>

For more information about Amazon Redshift ML, see the following documentation:
+ [ Costs for using Amazon Redshift ML](https://docs.aws.amazon.com/redshift/latest/dg/cost.html)
+ [CREATE MODEL operation](https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_MODEL.html)
+ [ EXPLAIN\$1MODEL function](https://docs.aws.amazon.com/redshift/latest/dg/r_explain_model_function.html)

For more information about machine learning, see the following documentation:
+ [Machine learning overview](https://docs.aws.amazon.com/redshift/latest/dg/machine_learning_overview.html)
+ [Machine learning for novices and experts](https://docs.aws.amazon.com/redshift/latest/dg/novice_expert.html)
+ [What Is Fairness and Model Explainability for Machine Learning Predictions?](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-fairness-and-explainability.html)

# Tutorial: Building multi-class classification models
<a name="tutorial_multi-class_classification"></a>

In this tutorial, you use Amazon Redshift ML to create a machine learning model that solves multi-class classification problems. The multi-class classification algorithm classifies data points into one of three or more classes. Then, you implement queries using the SQL function that the CREATE MODEL command generates.

You can use a CREATE MODEL command to export training data, train a model, import the model, and prepare an Amazon Redshift prediction function. Use the CREATE MODEL operation to specify training data either as a table or a SELECT statement.

To follow along with the tutorial, you use the public dataset [E-Commerce Sales Forecast](https://www.kaggle.com/allunia/e-commerce-sales-forecast), which includes sales data of an online UK retailer. The model you generate will target the most active customers for a special customer loyalty program. With multi-class classification, you can use the model to predict how many months a customer will be active over a 13-month period. The prediction function designates customers who are predicted to be active for 7 or more months for admission to the program.

## Use case examples
<a name="tutorial_multi-class_classification_tasks"></a>

You can solve other multi-class classification problems with Amazon Redshift ML, such as predicting the best-selling product from a product line. You could also predict which fruit an image contains, such as selecting apples or pears or oranges.

**Tasks**
+ Prerequisites
+ Step 1: Load the data from Amazon S3 to Amazon Redshift
+ Step 2: Create the machine learning model
+ Step 3: Perform predictions with the model

## Prerequisites
<a name="tutorial_multi-class_classification_prereqs"></a>

To complete this tutorial, you must complete the [Administrative setup](https://docs.aws.amazon.com/redshift/latest/dg/admin-setup.html) for Amazon Redshift ML.

## Step 1: Load the data from Amazon S3 to Amazon Redshift
<a name="tutorial_multi-class_classification_step_load"></a>

Use the [Amazon Redshift query editor v2](https://docs.aws.amazon.com/redshift/latest/mgmt/query-editor-v2-using.html) to run the following queries. These queries load the sample data into Amazon Redshift.

1. The following query creates a table named `ecommerce_sales`.

   ```
   CREATE TABLE IF NOT EXISTS ecommerce_sales (
       invoiceno VARCHAR(30),
       stockcode VARCHAR(30),
       description VARCHAR(60),
       quantity DOUBLE PRECISION,
       invoicedate VARCHAR(30),
       unitprice DOUBLE PRECISION,
       customerid BIGINT,
       country VARCHAR(25)
   );
   ```

1. The following query copies the sample data from the [E-Commerce Sales Forecast dataset](https://www.kaggle.com/allunia/e-commerce-sales-forecast) into the `ecommerce_sales`table.

   ```
   COPY ecommerce_sales
   FROM
       's3://redshift-ml-multiclass/ecommerce_data.txt' 
   IAM_ROLE default 
   DELIMITER '\t' 
   IGNOREHEADER 1 
   REGION 'us-east-1' 
   MAXERROR 100;
   ```

### Split the data
<a name="tutorial_multi-class_classification_split_data"></a>

When you create a model in Amazon Redshift ML, SageMaker AI automatically splits your data into training and test sets, so that SageMaker AI can determine the model accuracy. By manually splitting the data at this step, you will be able to verify the accuracy of the model by allocating an additional prediction set. 

Use the following SQL statement to split the data into three sets for training, validation, and prediction.

```
--creates table with all data
CREATE TABLE ecommerce_sales_data AS (
    SELECT
        t1.stockcode,
        t1.description,
        t1.invoicedate,
        t1.customerid,
        t1.country,
        t1.sales_amt,
        CAST(RANDOM() * 100 AS INT) AS data_group_id
    FROM
        (
            SELECT
                stockcode,
                description,
                invoicedate,
                customerid,
                country,
                SUM(quantity * unitprice) AS sales_amt
            FROM
                ecommerce_sales
            GROUP BY
                1,
                2,
                3,
                4,
                5
        ) t1
);

--creates training set
CREATE TABLE ecommerce_sales_training AS (
    SELECT
        a.customerid,
        a.country,
        a.stockcode,
        a.description,
        a.invoicedate,
        a.sales_amt,
        (b.nbr_months_active) AS nbr_months_active
    FROM
        ecommerce_sales_data a
        INNER JOIN (
            SELECT
                customerid,
                COUNT(
                    DISTINCT(
                        DATE_PART(y, CAST(invoicedate AS DATE)) || '-' || LPAD(
                            DATE_PART(mon, CAST(invoicedate AS DATE)),
                            2,
                            '00'
                        )
                    )
                ) AS nbr_months_active
            FROM
                ecommerce_sales_data
            GROUP BY
                1
        ) b ON a.customerid = b.customerid
    WHERE
        a.data_group_id < 80
);

--creates validation set
CREATE TABLE ecommerce_sales_validation AS (
    SELECT
        a.customerid,
        a.country,
        a.stockcode,
        a.description,
        a.invoicedate,
        a.sales_amt,
        (b.nbr_months_active) AS nbr_months_active
    FROM
        ecommerce_sales_data a
        INNER JOIN (
            SELECT
                customerid,
                COUNT(
                    DISTINCT(
                        DATE_PART(y, CAST(invoicedate AS DATE)) || '-' || LPAD(
                            DATE_PART(mon, CAST(invoicedate AS DATE)),
                            2,
                            '00'
                        )
                    )
                ) AS nbr_months_active
            FROM
                ecommerce_sales_data
            GROUP BY
                1
        ) b ON a.customerid = b.customerid
    WHERE
        a.data_group_id BETWEEN 80
        AND 90
);

--creates prediction set
CREATE TABLE ecommerce_sales_prediction AS (
    SELECT
        customerid,
        country,
        stockcode,
        description,
        invoicedate,
        sales_amt
    FROM
        ecommerce_sales_data
    WHERE
        data_group_id > 90);
```

## Step 2: Create the machine learning model
<a name="tutorial_multi-class_classification_step_create_model"></a>

In this step, you use the CREATE MODEL statement to create your machine learning model using multi-class classification. 

The following query creates the multi-class classification model with the training set using the CREATE MODEL operation. Replace amzn-s3-demo-bucket with your own Amazon S3 bucket.

```
CREATE MODEL ecommerce_customer_activity
FROM
    (
        SELECT
            customerid,
            country,
            stockcode,
            description,
            invoicedate,
            sales_amt,
            nbr_months_active
        FROM
            ecommerce_sales_training
    ) TARGET nbr_months_active FUNCTION predict_customer_activity IAM_ROLE default PROBLEM_TYPE MULTICLASS_CLASSIFICATION SETTINGS (
        S3_BUCKET 'amzn-s3-demo-bucket',
        S3_GARBAGE_COLLECT OFF
    );
```

In this query, you specify the problem type as `Multiclass_Classification`. The target that you predict for the model is `nbr_months_active`. When SageMaker AI finishes training the model, it creates the function `predict_customer_activity`, which you will use to make predictions in Amazon Redshift.

### Show the status of model training (optional)
<a name="tutorial_multi-class_classification_show_status"></a>

You can use the SHOW MODEL command to know when your model is ready.

Use the following query to return various metrics of the model, including model state and accuracy.

```
SHOW MODEL ecommerce_customer_activity;
```

When the model is ready, the output of the previous operation should show that the `Model State` is `Ready`. The following is an example of the output of the SHOW MODEL operation.

```
+--------------------------+-----------------------------------------------------------------------------------------------+
|        Model Name        |                                  ecommerce_customer_activity                                  |
+--------------------------+-----------------------------------------------------------------------------------------------+
|       Schema Name        |                                            public                                             |
|          Owner           |                                            awsuser                                            |
|      Creation Time       |                                   Fri, 17.06.2022 19:02:15                                    |
|       Model State        |                                             READY                                             |
|   Training Job Status    |                                  MaxAutoMLJobRuntimeReached                                   |
|   validation:accuracy    |                                           0.991280                                            |
|      Estimated Cost      |                                           7.897689                                            |
|                          |                                                                                               |
|      TRAINING DATA:      |                                                                                               |
|          Query           | SELECT CUSTOMERID, COUNTRY, STOCKCODE, DESCRIPTION, INVOICEDATE, SALES_AMT, NBR_MONTHS_ACTIVE |
|                          |                                 FROM ECOMMERCE_SALES_TRAINING                                 |
|      Target Column       |                                       NBR_MONTHS_ACTIVE                                       |
|                          |                                                                                               |
|       PARAMETERS:        |                                                                                               |
|        Model Type        |                                            xgboost                                            |
|       Problem Type       |                                   MulticlassClassification                                    |
|        Objective         |                                           Accuracy                                            |
|     AutoML Job Name      |                                redshiftml-20220617190215268770                                |
|      Function Name       |                                   predict_customer_activity                                   |
|   Function Parameters    |                customerid country stockcode description invoicedate sales_amt                 |
| Function Parameter Types |                          int8 varchar varchar varchar varchar float8                          |
|         IAM Role         |                                     default-aws-iam-role                                      |
|        S3 Bucket         |                                         amzn-s3-demo-bucket                                    |
|       Max Runtime        |                                             5400                                              |
+--------------------------+-----------------------------------------------------------------------------------------------+
```

## Step 3: Perform predictions with the model
<a name="tutorial_multi-class_classification_step_perform_predictions"></a>

The following query shows which customers qualify for your customer loyalty program. If the model predicts that the customer will be active for at least seven months, then the model selects the customer for the loyalty program.

```
SELECT
    customerid,
    predict_customer_activity(
        customerid,
        country,
        stockcode,
        description,
        invoicedate,
        sales_amt
    ) AS predicted_months_active
FROM
    ecommerce_sales_prediction
WHERE
    predicted_months_active >= 7
GROUP BY
    1,
    2
LIMIT
    10;
```

### Run prediction queries against the validation data (optional)
<a name="tutorial_multi-class_classification_run_validation_prediction"></a>

Run the following prediction queries against the validation data to see the model’s level of accuracy.

```
SELECT
    CAST(SUM(t1.match) AS decimal(7, 2)) AS predicted_matches,
    CAST(SUM(t1.nonmatch) AS decimal(7, 2)) AS predicted_non_matches,
    CAST(SUM(t1.match + t1.nonmatch) AS decimal(7, 2)) AS total_predictions,
    predicted_matches / total_predictions AS pct_accuracy
FROM
    (
        SELECT
            customerid,
            country,
            stockcode,
            description,
            invoicedate,
            sales_amt,
            nbr_months_active,
            predict_customer_activity(
                customerid,
                country,
                stockcode,
                description,
                invoicedate,
                sales_amt
            ) AS predicted_months_active,
            CASE
                WHEN nbr_months_active = predicted_months_active THEN 1
                ELSE 0
            END AS match,
            CASE
                WHEN nbr_months_active <> predicted_months_active THEN 1
                ELSE 0
            END AS nonmatch
        FROM
            ecommerce_sales_validation
    )t1;
```

### Predict how many customers miss entry (optional)
<a name="tutorial_multi-class_classification_run_missed_entries"></a>

The following query compares the number of customers that are predicted to be active for only 5 or 6 months. The model predicts that these customers will miss out on the loyalty program. The query then compares the amount that barely miss the program to the number that are predicted to be eligible for the loyalty program. This query could be used to inform a decision on whether to lower the threshold for the loyalty program. You can also determine if there is a significant amount of customers that are predicted to barely miss out on the program. You could then encourage those customers to increase their activity to get a loyalty program membership.

```
SELECT
    predict_customer_activity(
        customerid,
        country,
        stockcode,
        description,
        invoicedate,
        sales_amt
    ) AS predicted_months_active,
    COUNT(customerid)
FROM
    ecommerce_sales_prediction
WHERE
    predicted_months_active BETWEEN 5 AND 6
GROUP BY
    1
ORDER BY
    1 ASC
LIMIT
    10)
UNION
(SELECT
      NULL AS predicted_months_active,
    COUNT (customerid)
FROM 
    ecommerce_sales_prediction
WHERE
    predict_customer_activity(
        customerid,
        country,
        stockcode,
        description,
        invoicedate,
        sales_amt
    ) >=7);
```

## Related topics
<a name="tutorial_multi-class_classification_related_topics"></a>

For more information about Amazon Redshift ML, see the following documentation:
+ [Costs for using Amazon Redshift ML](https://docs.aws.amazon.com/redshift/latest/dg/cost.html)
+ [CREATE MODEL operation](https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_MODEL.html)
+ [EXPLAIN\$1MODEL function](https://docs.aws.amazon.com/redshift/latest/dg/r_explain_model_function.html)

For more information about machine learning, see the following documentation:
+ [Machine learning overview](https://docs.aws.amazon.com/redshift/latest/dg/machine_learning_overview.html)
+ [Machine learning for novices and experts](https://docs.aws.amazon.com/redshift/latest/dg/novice_expert.html)
+ [What Is Fairness and Model Explainability for Machine Learning Predictions?](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-fairness-and-explainability.html)

# Tutorial: Building XGBoost models
<a name="tutorial_xgboost"></a>

In this tutorial, you create a model with data from Amazon S3 and run prediction queries with the model using Amazon Redshift ML. The XGBoost algorithm is an optimized implementation of the gradient boosted trees algorithm. XGBoost handles more data types, relationships, and distributions than other gradient boosted trees algorithms. You can use XGBoost for regression, binary classification, multi-class classification, and ranking problems. For more information about the XGBoost algorithm, see [XGBoost algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) in the Amazon SageMaker AI Developer Guide.

The Amazon Redshift ML `CREATE MODEL` operation with the `AUTO OFF` option currently supports XGBoost as the `MODEL_TYPE`. You can provide relevant information such as the objective and hyperparameters as part of the `CREATE MODEL` command, based on your use case.

In this tutorial, you use the [banknote authentication dataset](https://archive.ics.uci.edu/ml/datasets/banknote+authentication), which is a binary classification problem to predict whether a given banknote is genuine or forged. 

## Use case examples
<a name="tutorial_xgboost_tasks"></a>

You can solve other binary classification problems using Amazon Redshift ML, such as predicting whether a patient is healthy or has a disease. You could also predict whether an email is spam or not spam.

**Tasks**
+ Prerequisites
+ Step 1: Load the data from Amazon S3 to Amazon Redshift
+ Step 2: Create the machine learning model
+ Step 3: Perform predictions with the model

## Prerequisites
<a name="tutorial_xgboost_prereqs"></a>

To complete this tutorial, you must complete the [Administrative setup](https://docs.aws.amazon.com/redshift/latest/dg/admin-setup.html) for Amazon Redshift ML.

## Step 1: Load the data from Amazon S3 to Amazon Redshift
<a name="tutorial_xgboost_step_load"></a>

Use the [Amazon Redshift query editor v2](https://docs.aws.amazon.com/redshift/latest/mgmt/query-editor-v2-using.html) to run the following queries.

The following query creates two tables, loads the data from Amazon S3, and splits the data into a training set and a testing set. You will use the training set to train your model and create the prediction function. Then, you will test the prediction function on the testing set.

```
--create training set table
CREATE TABLE banknoteauthentication_train(
    variance FLOAT,
    skewness FLOAT,
    curtosis FLOAT,
    entropy FLOAT,
    class INT
);

--Load into training table
COPY banknoteauthentication_train
FROM
    's3://redshiftbucket-ml-sagemaker/banknote_authentication/train_data/' IAM_ROLE default REGION 'us-west-2' IGNOREHEADER 1 CSV;

--create testing set table
CREATE TABLE banknoteauthentication_test(
    variance FLOAT,
    skewness FLOAT,
    curtosis FLOAT,
    entropy FLOAT,
    class INT
);

--Load data into testing table
COPY banknoteauthentication_test
FROM
    's3://redshiftbucket-ml-sagemaker/banknote_authentication/test_data/' 
    IAM_ROLE default 
    REGION 'us-west-2' 
    IGNOREHEADER 1 
    CSV;
```

## Step 2: Create the machine learning model
<a name="tutorial_xgboost_step_create_model"></a>

The following query creates the XGBoost model in Amazon Redshift ML from the training set you created in the previous step. Replace `amzn-s3-demo-bucket` with your own `S3_BUCKET`, which will store your input datasets and other Redshift ML artifacts.

```
CREATE MODEL model_banknoteauthentication_xgboost_binary
FROM
    banknoteauthentication_train 
    TARGET class 
    FUNCTION func_model_banknoteauthentication_xgboost_binary 
    IAM_ROLE default 
    AUTO OFF 
    MODEL_TYPE xgboost 
    OBJECTIVE 'binary:logistic' 
    PREPROCESSORS 'none' 
    HYPERPARAMETERS DEFAULT
EXCEPT(NUM_ROUND '100') 
SETTINGS(S3_BUCKET 'amzn-s3-demo-bucket');
```

### Show the status of model training (optional)
<a name="tutorial_xgboost_show_status"></a>

You can use the SHOW MODEL command to know when your model is ready.

Use the following query to monitor the progress of the model training.

```
SHOW MODEL model_banknoteauthentication_xgboost_binary;
```

If the model is `READY`, the SHOW MODEL operation also provides the `train:error` metric, as shown in the following example of the output. The `train:error` metric is a measure of accuracy of your model that measures to six decimal places. A value of 0 is most accurate and a value of 1 is least accurate.

```
+--------------------------+--------------------------------------------------+
|        Model Name        |   model_banknoteauthentication_xgboost_binary    |
+--------------------------+--------------------------------------------------+
| Schema Name              | public                                           |
| Owner                    | awsuser                                          |
| Creation Time            | Tue, 21.06.2022 19:07:35                         |
| Model State              | READY                                            |
| train:error              |                                         0.000000 |
| Estimated Cost           |                                         0.006197 |
|                          |                                                  |
| TRAINING DATA:           |                                                  |
| Query                    | SELECT *                                         |
|                          | FROM "BANKNOTEAUTHENTICATION_TRAIN"              |
| Target Column            | CLASS                                            |
|                          |                                                  |
| PARAMETERS:              |                                                  |
| Model Type               | xgboost                                          |
| Training Job Name        | redshiftml-20220621190735686935-xgboost          |
| Function Name            | func_model_banknoteauthentication_xgboost_binary |
| Function Parameters      | variance skewness curtosis entropy               |
| Function Parameter Types | float8 float8 float8 float8                      |
| IAM Role                 | default-aws-iam-role                             |
| S3 Bucket                | amzn-s3-demo-bucket                              |
| Max Runtime              |                                             5400 |
|                          |                                                  |
| HYPERPARAMETERS:         |                                                  |
| num_round                |                                              100 |
| objective                | binary:logistic                                  |
+--------------------------+--------------------------------------------------+
```

## Step 3: Perform predictions with the model
<a name="tutorial_xgboost_step_perform_predictions"></a>

### Check the accuracy of the model
<a name="tutorial_xgboost_check_accuracy"></a>

The following prediction query uses the prediction function created in the previous step to check the accuracy of your model. Run this query on the testing set to make sure the model does not correspond too closely to the training set. This close correspondence is also known as overfitting, and overfitting could cause the model to make unreliable predictions.

```
WITH predict_data AS (
    SELECT
        class AS label,
        func_model_banknoteauthentication_xgboost_binary (variance, skewness, curtosis, entropy) AS predicted,
        CASE
            WHEN label IS NULL THEN 0
            ELSE label
        END AS actual,
        CASE
            WHEN actual = predicted THEN 1 :: INT
            ELSE 0 :: INT
        END AS correct
    FROM
        banknoteauthentication_test
),
aggr_data AS (
    SELECT
        SUM(correct) AS num_correct,
        COUNT(*) AS total
    FROM
        predict_data
)
SELECT
    (num_correct :: FLOAT / total :: FLOAT) AS accuracy
FROM
    aggr_data;
```

### Predict the amount of original and counterfeit banknotes
<a name="tutorial_xgboost_predict_amount"></a>

The following prediction query returns the predicted amount of original and counterfeit banknotes in the testing set.

```
WITH predict_data AS (
    SELECT
        func_model_banknoteauthentication_xgboost_binary(variance, skewness, curtosis, entropy) AS predicted
    FROM
        banknoteauthentication_test
)
SELECT
    CASE
        WHEN predicted = '0' THEN 'Original banknote'
        WHEN predicted = '1' THEN 'Counterfeit banknote'
        ELSE 'NA'
    END AS banknote_authentication,
    COUNT(1) AS count
FROM
    predict_data
GROUP BY
    1;
```

### Find the average observation for an original and a counterfeit banknote
<a name="tutorial_xgboost_find_average_observation"></a>

The following prediction query returns the average value of each feature for banknotes that are predicted to be original and counterfeit in the testing set.

```
WITH predict_data AS (
    SELECT
        func_model_banknoteauthentication_xgboost_binary(variance, skewness, curtosis, entropy) AS predicted,
          variance,
          skewness,
          curtosis,
          entropy
    FROM
        banknoteauthentication_test
)
SELECT
    CASE
        WHEN predicted = '0' THEN 'Original banknote'
        WHEN predicted = '1' THEN 'Counterfeit banknote'
        ELSE 'NA'
    END AS banknote_authentication,
    TRUNC(AVG(variance), 2) AS avg_variance,
    TRUNC(AVG(skewness), 2) AS avg_skewness,
    TRUNC(AVG(curtosis), 2) AS avg_curtosis,
    TRUNC(AVG(entropy), 2) AS avg_entropy
FROM
    predict_data
GROUP BY
    1
ORDER BY
    2;
```

## Related topics
<a name="tutorial_xgboost_related_topics"></a>

For more information about Amazon Redshift ML, see the following documentation:
+ [Costs for using Amazon Redshift ML](https://docs.aws.amazon.com/redshift/latest/dg/cost.html)
+ [CREATE MODEL operation](https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_MODEL.html)
+ [EXPLAIN\$1MODEL function](https://docs.aws.amazon.com/redshift/latest/dg/r_explain_model_function.html)

For more information about machine learning, see the following documentation:
+ [Machine learning overview](https://docs.aws.amazon.com/redshift/latest/dg/machine_learning_overview.html)
+ [Machine learning for novices and experts](https://docs.aws.amazon.com/redshift/latest/dg/novice_expert.html)
+ [What Is Fairness and Model Explainability for Machine Learning Predictions?](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-fairness-and-explainability.html)

# Tutorial: Building regression models
<a name="tutorial_regression"></a>

In this tutorial, you use Amazon Redshift ML to create a machine learning regression model and run prediction queries on the model. Regression models allow you to predict numerical outcomes, such as the price of a house, or how many people will use a city’s bike rental service. You use the CREATE MODEL command in Amazon Redshift with your training data. Then, Amazon Redshift ML compiles the model, imports the trained model to Redshift, and prepares a SQL prediction function. You can use the prediction function in SQL queries in Amazon Redshift.

In this tutorial, you will use Amazon Redshift ML to build a regression model that predicts the number of people that use the city of Toronto’s bike sharing service at any given hour of a day. The inputs for the model include holidays and weather conditions. You will use a regression model, because you want a numerical outcome for this problem.

You can use the CREATE MODEL command to export training data, train a model, and make the model available in Amazon Redshift as a SQL function. Use the CREATE MODEL operation to specify training data either as a table or a SELECT statement.

## Use case examples
<a name="tutorial_regression_tasks"></a>

You can solve other regression problems with Amazon Redshift ML, such as predicting a customer’s lifetime value. You could also use Redshift ML to predict the most profitable price and the resulting revenue of a product.

**Tasks**
+ Prerequisites
+ Step 1: Load the data from Amazon S3 to Amazon Redshift
+ Step 2: Create the machine learning model
+ Step 3: Validate the model

## Prerequisites
<a name="tutorial_regression_prereqs"></a>

To complete this tutorial, you must complete the [Administrative setup](https://docs.aws.amazon.com/redshift/latest/dg/admin-setup.html) for Amazon Redshift ML.

## Step 1: Load the data from Amazon S3 to Amazon Redshift
<a name="tutorial_regression_step_load"></a>

Use the [Amazon Redshift query editor v2](https://docs.aws.amazon.com/redshift/latest/mgmt/query-editor-v2-using.html) to run the following queries.

1. You must create three tables to load the three public datasets into Amazon Redshift. The datasets are [Toronto Bike Ridership Data](https://open.toronto.ca/dataset/bike-share-toronto-ridership-data/), [historical weather data](https://climate.weather.gc.ca/historical_data/search_historic_data_e.html), and [historical holidays data](https://github.com/uWaterloo/Datasets/blob/master/Holidays/holidays.csv). Run the following query in the Amazon Redshift query editor to create tables named `ridership`, `weather`, and `holiday`.

   ```
   CREATE TABLE IF NOT EXISTS ridership (
       trip_id INT,
       trip_duration_seconds INT,
       trip_start_time timestamp,
       trip_stop_time timestamp,
       from_station_name VARCHAR(50),
       to_station_name VARCHAR(50),
       from_station_id SMALLINT,
       to_station_id SMALLINT,
       user_type VARCHAR(20)
   );
   
   CREATE TABLE IF NOT EXISTS weather (
       longitude_x DECIMAL(5, 2),
       latitude_y DECIMAL(5, 2),
       station_name VARCHAR(20),
       climate_id BIGINT,
       datetime_utc TIMESTAMP,
       weather_year SMALLINT,
       weather_month SMALLINT,
       weather_day SMALLINT,
       time_utc VARCHAR(5),
       temp_c DECIMAL(5, 2),
       temp_flag VARCHAR(1),
       dew_point_temp_c DECIMAL(5, 2),
       dew_point_temp_flag VARCHAR(1),
       rel_hum SMALLINT,
       rel_hum_flag VARCHAR(1),
       precip_amount_mm DECIMAL(5, 2),
       precip_amount_flag VARCHAR(1),
       wind_dir_10s_deg VARCHAR(10),
       wind_dir_flag VARCHAR(1),
       wind_spd_kmh VARCHAR(10),
       wind_spd_flag VARCHAR(1),
       visibility_km VARCHAR(10),
       visibility_flag VARCHAR(1),
       stn_press_kpa DECIMAL(5, 2),
       stn_press_flag VARCHAR(1),
       hmdx SMALLINT,
       hmdx_flag VARCHAR(1),
       wind_chill VARCHAR(10),
       wind_chill_flag VARCHAR(1),
       weather VARCHAR(10)
   );
   
   CREATE TABLE IF NOT EXISTS holiday (holiday_date DATE, description VARCHAR(100));
   ```

1. The following query loads the sample data into the tables that you created in the previous step.

   ```
   COPY ridership
   FROM
       's3://redshift-ml-bikesharing-data/bike-sharing-data/ridership/' 
       IAM_ROLE default 
       FORMAT CSV 
       IGNOREHEADER 1 
       DATEFORMAT 'auto' 
       TIMEFORMAT 'auto' 
       REGION 'us-west-2' 
       gzip;
   
   COPY weather
   FROM
       's3://redshift-ml-bikesharing-data/bike-sharing-data/weather/' 
       IAM_ROLE default 
       FORMAT csv 
       IGNOREHEADER 1 
       DATEFORMAT 'auto' 
       TIMEFORMAT 'auto' 
       REGION 'us-west-2' 
       gzip;
   
   COPY holiday
   FROM
       's3://redshift-ml-bikesharing-data/bike-sharing-data/holiday/' 
       IAM_ROLE default 
       FORMAT csv 
       IGNOREHEADER 1 
       DATEFORMAT 'auto' 
       TIMEFORMAT 'auto' 
       REGION 'us-west-2' 
       gzip;
   ```

1. The following query performs transformations on the `ridership` and `weather` datasets to remove bias or anomalies. Removing bias and anomalies results in improved model accuracy. The query simplifies the tables by creating two new views called `ridership_view` and `weather_view`.

   ```
   CREATE
   OR REPLACE VIEW ridership_view AS
   SELECT
       trip_time,
       trip_count,
       TO_CHAR(trip_time, 'hh24') :: INT trip_hour,
       TO_CHAR(trip_time, 'dd') :: INT trip_day,
       TO_CHAR(trip_time, 'mm') :: INT trip_month,
       TO_CHAR(trip_time, 'yy') :: INT trip_year,
       TO_CHAR(trip_time, 'q') :: INT trip_quarter,
       TO_CHAR(trip_time, 'w') :: INT trip_month_week,
       TO_CHAR(trip_time, 'd') :: INT trip_week_day
   FROM
       (
           SELECT
               CASE
                   WHEN TRUNC(r.trip_start_time) < '2017-07-01' :: DATE THEN CONVERT_TIMEZONE(
                       'US/Eastern',
                       DATE_TRUNC('hour', r.trip_start_time)
                   )
                   ELSE DATE_TRUNC('hour', r.trip_start_time)
               END trip_time,
               COUNT(1) trip_count
           FROM
               ridership r
           WHERE
               r.trip_duration_seconds BETWEEN 60
               AND 60 * 60 * 24
           GROUP BY
               1
       );
   
   CREATE
   OR REPLACE VIEW weather_view AS
   SELECT
       CONVERT_TIMEZONE(
           'US/Eastern',
           DATE_TRUNC('hour', datetime_utc)
       ) daytime,
       ROUND(AVG(temp_c)) temp_c,
       ROUND(AVG(precip_amount_mm)) precip_amount_mm
   FROM
       weather
   GROUP BY
       1;
   ```

1. The following query creates a table that combines all the relevant input attributes from `ridership_view` and `weather_view` into the `trip_data` table.

   ```
   CREATE TABLE trip_data AS
   SELECT
       r.trip_time,
       r.trip_count,
       r.trip_hour,
       r.trip_day,
       r.trip_month,
       r.trip_year,
       r.trip_quarter,
       r.trip_month_week,
       r.trip_week_day,
       w.temp_c,
       w.precip_amount_mm,CASE
           WHEN h.holiday_date IS NOT NULL THEN 1
           WHEN TO_CHAR(r.trip_time, 'D') :: INT IN (1, 7) THEN 1
           ELSE 0
       END is_holiday,
       ROW_NUMBER() OVER (
           ORDER BY
               RANDOM()
       ) serial_number
   FROM
       ridership_view r
       JOIN weather_view w ON (r.trip_time = w.daytime)
       LEFT OUTER JOIN holiday h ON (TRUNC(r.trip_time) = h.holiday_date);
   ```

### View the sample data (optional)
<a name="tutorial_regression_view_data"></a>

The following query shows entries from the table. You can run this operation to make sure the table was made correctly.

```
SELECT * 
FROM trip_data 
LIMIT 5;
```

The following is an example of the output of the previous operation.

```
+---------------------+------------+-----------+----------+------------+-----------+--------------+-----------------+---------------+--------+------------------+------------+---------------+
|      trip_time      | trip_count | trip_hour | trip_day | trip_month | trip_year | trip_quarter | trip_month_week | trip_week_day | temp_c | precip_amount_mm | is_holiday | serial_number |
+---------------------+------------+-----------+----------+------------+-----------+--------------+-----------------+---------------+--------+------------------+------------+---------------+
| 2017-03-21 22:00:00 |         47 |        22 |       21 |          3 |        17 |            1 |               3 |             3 |      1 |                0 |          0 |             1 |
| 2018-05-04 01:00:00 |         19 |         1 |        4 |          5 |        18 |            2 |               1 |             6 |     12 |                0 |          0 |             3 |
| 2018-01-11 10:00:00 |         93 |        10 |       11 |          1 |        18 |            1 |               2 |             5 |      9 |                0 |          0 |             5 |
| 2017-10-28 04:00:00 |         20 |         4 |       28 |         10 |        17 |            4 |               4 |             7 |     11 |                0 |          1 |             7 |
| 2017-12-31 21:00:00 |         11 |        21 |       31 |         12 |        17 |            4 |               5 |             1 |    -15 |                0 |          1 |             9 |
+---------------------+------------+-----------+----------+------------+-----------+--------------+-----------------+---------------+--------+------------------+------------+---------------+
```

### Show the correlation between attributes (optional)
<a name="tutorial_regression_show_correlation"></a>

Determining correlation helps you measure the strength of association between attributes. The level of association can help you determine what affects your target output. In this tutorial, the target output is `trip_count`.

The following query creates or replaces the `sp_correlation` procedure. You use the stored procedure called `sp_correlation` to show the correlation between an attribute and other attributes in a table in Amazon Redshift.

```
CREATE OR REPLACE PROCEDURE sp_correlation(source_schema_name in varchar(255), source_table_name in varchar(255), target_column_name in varchar(255), output_temp_table_name inout varchar(255)) AS $$
DECLARE
  v_sql varchar(max);
  v_generated_sql varchar(max);
  v_source_schema_name varchar(255)=lower(source_schema_name);
  v_source_table_name varchar(255)=lower(source_table_name);
  v_target_column_name varchar(255)=lower(target_column_name);
BEGIN
  EXECUTE 'DROP TABLE IF EXISTS ' || output_temp_table_name;
  v_sql = '
SELECT
  ''CREATE temp table '|| output_temp_table_name||' AS SELECT ''|| outer_calculation||
  '' FROM (SELECT COUNT(1) number_of_items, SUM('||v_target_column_name||') sum_target, SUM(POW('||v_target_column_name||',2)) sum_square_target, POW(SUM('||v_target_column_name||'),2) square_sum_target,''||
  inner_calculation||
  '' FROM (SELECT ''||
  column_name||
  '' FROM '||v_source_table_name||'))''
FROM
  (
  SELECT
    DISTINCT
    LISTAGG(outer_calculation,'','') OVER () outer_calculation
    ,LISTAGG(inner_calculation,'','') OVER () inner_calculation
    ,LISTAGG(column_name,'','') OVER () column_name
  FROM
    (
    SELECT
      CASE WHEN atttypid=16 THEN ''DECODE(''||column_name||'',true,1,0)'' ELSE column_name END column_name
      ,atttypid
      ,''CAST(DECODE(number_of_items * sum_square_''||rn||'' - square_sum_''||rn||'',0,null,(number_of_items*sum_target_''||rn||'' - sum_target * sum_''||rn||
        '')/SQRT((number_of_items * sum_square_target - square_sum_target) * (number_of_items * sum_square_''||rn||
        '' - square_sum_''||rn||''))) AS numeric(5,2)) ''||column_name outer_calculation
      ,''sum(''||column_name||'') sum_''||rn||'',''||
            ''SUM(trip_count*''||column_name||'') sum_target_''||rn||'',''||
            ''SUM(POW(''||column_name||'',2)) sum_square_''||rn||'',''||
            ''POW(SUM(''||column_name||''),2) square_sum_''||rn inner_calculation
    FROM
      (
      SELECT
        row_number() OVER (order by a.attnum) rn
        ,a.attname::VARCHAR column_name
        ,a.atttypid
      FROM pg_namespace AS n
        INNER JOIN pg_class AS c ON n.oid = c.relnamespace
        INNER JOIN pg_attribute AS a ON c.oid = a.attrelid
      WHERE a.attnum > 0
        AND n.nspname = '''||v_source_schema_name||'''
        AND c.relname = '''||v_source_table_name||'''
        AND a.atttypid IN (16,20,21,23,700,701,1700)
      )
    )
)';
  EXECUTE v_sql INTO v_generated_sql;
  EXECUTE  v_generated_sql;
END;
$$ LANGUAGE plpgsql;
```

The following query shows the correlation between the target column, `trip_count`, and other numeric attributes in our dataset.

```
call sp_correlation(
    'public',
    'trip_data',
    'trip_count',
    'tmp_corr_table'
);

SELECT
    *
FROM
    tmp_corr_table;
```

The following example shows the output of the previous `sp_correlation` operation.

```
+------------+-----------+----------+------------+-----------+--------------+-----------------+---------------+--------+------------------+------------+---------------+
| trip_count | trip_hour | trip_day | trip_month | trip_year | trip_quarter | trip_month_week | trip_week_day | temp_c | precip_amount_mm | is_holiday | serial_number |
+------------+-----------+----------+------------+-----------+--------------+-----------------+---------------+--------+------------------+------------+---------------+
|          1 |      0.32 |     0.01 |       0.18 |      0.12 |         0.18 |               0 |          0.02 |   0.53 |            -0.07 |      -0.13 |             0 |
+------------+-----------+----------+------------+-----------+--------------+-----------------+---------------+--------+------------------+------------+---------------+
```

## Step 2: Create the machine learning model
<a name="tutorial_regression_create_model"></a>

1. The following query splits your data into a training set and a validation set by designating 80% of the dataset for training and 20% for validation. The training set is the input for the ML model to identify the best possible algorithm for the model. After the model is created, you use the validation set to validate the model accuracy.

   ```
   CREATE TABLE training_data AS
   SELECT
       trip_count,
       trip_hour,
       trip_day,
       trip_month,
       trip_year,
       trip_quarter,
       trip_month_week,
       trip_week_day,
       temp_c,
       precip_amount_mm,
       is_holiday
   FROM
       trip_data
   WHERE
       serial_number > (
           SELECT
               COUNT(1) * 0.2
           FROM
               trip_data
       );
   
   CREATE TABLE validation_data AS
   SELECT
       trip_count,
       trip_hour,
       trip_day,
       trip_month,
       trip_year,
       trip_quarter,
       trip_month_week,
       trip_week_day,
       temp_c,
       precip_amount_mm,
       is_holiday,
       trip_time
   FROM
       trip_data
   WHERE
       serial_number <= (
           SELECT
               COUNT(1) * 0.2
           FROM
               trip_data
       );
   ```

1. The following query creates a regression model to predict the `trip_count` value for any input date and time. In the following example, replace amzn-s3-demo-bucket with your own S3 bucket.

   ```
   CREATE MODEL predict_rental_count
   FROM
       training_data TARGET trip_count FUNCTION predict_rental_count 
       IAM_ROLE default 
       PROBLEM_TYPE regression 
       OBJECTIVE 'mse' 
       SETTINGS (
           s3_bucket 'amzn-s3-demo-bucket',
           s3_garbage_collect off,
           max_runtime 5000
       );
   ```

## Step 3: Validate the model
<a name="tutorial_regression_step_validate"></a>

1. Use the following query to output aspects of the model, and find the mean square error metric in the output. Mean square error is a typical accuracy metric for regression problems.

   ```
   show model predict_rental_count;
   ```

1. Run the following prediction queries against the validation data to compare the predicted trip count to the actual trip count.

   ```
   SELECT
       trip_time,
       actual_count,
       predicted_count,
       (actual_count - predicted_count) difference
   FROM
       (
           SELECT
               trip_time,
               trip_count AS actual_count,
               PREDICT_RENTAL_COUNT (
                   trip_hour,
                   trip_day,
                   trip_month,
                   trip_year,
                   trip_quarter,
                   trip_month_week,
                   trip_week_day,
                   temp_c,
                   precip_amount_mm,
                   is_holiday
               ) predicted_count
           FROM
               validation_data
       )
   LIMIT
       5;
   ```

1. The following query calculates the mean square error and root mean square error based on your validation data. You use mean square error and root mean square error to measure the distance between the predicted numeric target and the actual numeric answer. A good model has a low score in both metrics. The following query returns the value of both metrics.

   ```
   SELECT
       ROUND(
           AVG(POWER((actual_count - predicted_count), 2)),
           2
       ) mse,
       ROUND(
           SQRT(AVG(POWER((actual_count - predicted_count), 2))),
           2
       ) rmse
   FROM
       (
           SELECT
               trip_time,
               trip_count AS actual_count,
               PREDICT_RENTAL_COUNT (
                   trip_hour,
                   trip_day,
                   trip_month,
                   trip_year,
                   trip_quarter,
                   trip_month_week,
                   trip_week_day,
                   temp_c,
                   precip_amount_mm,
                   is_holiday
               ) predicted_count
           FROM
               validation_data
       );
   ```

1. The following query calculates the percent error in trip count for each trip time on 2017-01-01. The query orders the trip times from the time with the lowest percent error to the time with the highest percent error.

   ```
   SELECT
       trip_time,
       CAST(ABS(((actual_count - predicted_count) / actual_count)) * 100 AS DECIMAL (7,2)) AS pct_error
   FROM
       (
           SELECT
               trip_time,
               trip_count AS actual_count,
               PREDICT_RENTAL_COUNT (
                   trip_hour,
                   trip_day,
                   trip_month,
                   trip_year,
                   trip_quarter,
                   trip_month_week,
                   trip_week_day,
                   temp_c,
                   precip_amount_mm,
                   is_holiday
               ) predicted_count
           FROM
               validation_data
       )
   WHERE
      trip_time LIKE '2017-01-01 %%:%%:%%'
   ORDER BY
      2 ASC;
   ```

## Related topics
<a name="tutorial_regression_related_topics"></a>

For more information about Amazon Redshift ML, see the following documentation:
+ [Costs for using Amazon Redshift ML](https://docs.aws.amazon.com/redshift/latest/dg/cost.html)
+ [CREATE MODEL operation](https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_MODEL.html)
+ [EXPLAIN\$1MODEL function](https://docs.aws.amazon.com/redshift/latest/dg/r_explain_model_function.html)

For more information about machine learning, see the following documentation:
+ [Machine learning overview](https://docs.aws.amazon.com/redshift/latest/dg/machine_learning_overview.html)
+ [Machine learning for novices and experts](https://docs.aws.amazon.com/redshift/latest/dg/novice_expert.html)
+ [What Is Fairness and Model Explainability for Machine Learning Predictions?](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-fairness-and-explainability.html)

# Tutorial: Building regression models with linear learner
<a name="tutorial_linear_learner_regression"></a>

In this tutorial, you create a linear learner model with data from Amazon S3 and run prediction queries with the model using Amazon Redshift ML. The SageMaker AI linear learner algorithm solves either regression or multi-class classification problems. To learn more about regression and multi-class classification problems, see [Problem types for the machine learning paradigms](https://docs.aws.amazon.com/sagemaker/latest/dg/algorithms-choose.html#basic-machine-learning-paradigms) in the Amazon SageMaker AI Developer Guide. In this tutorial, you solve a regression problem. The linear learner algorithm trains many models in parallel, and automatically determines the most optimized model. You use the CREATE MODEL operation in Amazon Redshift, which creates your linear learner model using SageMaker AI and sends a prediction function to Amazon Redshift. For more information about the linear learner algorithm, see [Linear Learner Algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/linear-learner.html) in the Amazon SageMaker AI Developer Guide.

You can use a CREATE MODEL command to export training data, train a model, import the model, and prepare an Amazon Redshift prediction function. Use the CREATE MODEL operation to specify training data either as a table or a SELECT statement.

Linear learner models optimize either continuous objectives or discrete objectives. Continuous objectives are used for regression, while discrete variables are used for classification. Some methods provide a solution for only continuous objectives, such as the regression method. The linear learner algorithm provides an increase in speed over naive hyperparameter optimization techniques, such as the Naive Bayes technique. A naive optimization technique assumes that each input variable is independent. To use the linear learner algorithm, you must provide columns representing the dimensions of the inputs, and rows representing the observations. For more information about the linear learner algorithm, see the [Linear Learner Algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/linear-learner.html) in the Amazon SageMaker AI Developer Guide. 

In this tutorial, you build a linear learner model that predicts the age of abalone. You use the CREATE MODEL command on the [Abalone dataset](http://archive.ics.uci.edu/ml/datasets/Abalone) to determine the relationship between the physical measurements of abalone. Then, you use the model to determine the age of abalone.

## Use case examples
<a name="tutorial_linear_learner_regression_tasks"></a>

You can solve other regression problems with linear learner and Amazon Redshift ML, such as predicting the price of a house. You could also use Redshift ML to predict the number of people who will use a city’s bike rental service.

**Tasks**
+ Prerequisites
+ Step 1: Load the data from Amazon S3 to Amazon Redshift
+ Step 2: Create the machine learning model
+ Step 3: Validate the model

## Prerequisites
<a name="tutorial_linear_learner_regression_prereqs"></a>

To complete this tutorial, you must complete the [Administrative setup](https://docs.aws.amazon.com/redshift/latest/dg/admin-setup.html) for Amazon Redshift ML.

## Step 1: Load the data from Amazon S3 to Amazon Redshift
<a name="tutorial_linear_learner_regression_step_load_data"></a>

Use the [Amazon Redshift query editor v2](https://docs.aws.amazon.com/redshift/latest/mgmt/query-editor-v2-using.html) to run the following queries. These queries load the sample data into Redshift and divide the data into a training set and a validation set.

1. The following query creates the `abalone_dataset` table.

   ```
   CREATE TABLE abalone_dataset (
       id INT IDENTITY(1, 1),
       Sex CHAR(1),
       Length float,
       Diameter float,
       Height float,
       Whole float,
       Shucked float,
       Viscera float,
       Shell float,
       Rings integer
   );
   ```

1. The following query copies the sample data from the [Abalone dataset](http://archive.ics.uci.edu/ml/datasets/Abalone) in Amazon S3 to the `abalone_dataset` table you created previously in Amazon Redshift.

   ```
   COPY abalone_dataset
   FROM
       's3://redshift-ml-multiclass/abalone.csv' REGION 'us-east-1' IAM_ROLE default CSV IGNOREHEADER 1 NULL AS 'NULL';
   ```

1. By manually splitting the data, you will be able to verify the accuracy of the model by allocating an additional prediction set. The following query splits the data into two sets. The `abalone_training` table is for training and the `abalone_validation` table is for validation.

   ```
   CREATE TABLE abalone_training as 
   SELECT 
       *
   FROM
       abalone_dataset
   WHERE
       mod(id, 10) < 8;
   
   CREATE TABLE abalone_validation as 
   SELECT 
       *
   FROM
       abalone_dataset
   WHERE
       mod(id, 10) >= 8;
   ```

## Step 2: Create the machine learning model
<a name="tutorial_linear_learner_regression_step_create_model"></a>

In this step, you use the CREATE MODEL statement to create your machine learning model with the linear learner algorithm. 

The following query creates the linear learner model with the CREATE MODEL operation using your S3 bucket. Replace amzn-s3-demo-bucket with your own S3 bucket.

```
CREATE MODEL model_abalone_ring_prediction
FROM
    (
        SELECT
            Sex,
            Length,
            Diameter,
            Height,
            Whole,
            Shucked,
            Viscera,
            Shell,
            Rings AS target_label
        FROM
            abalone_training
    ) TARGET target_label FUNCTION f_abalone_ring_prediction IAM_ROLE default MODEL_TYPE LINEAR_LEARNER PROBLEM_TYPE REGRESSION OBJECTIVE 'MSE' SETTINGS (
        S3_BUCKET 'amzn-s3-demo-bucket',
        MAX_RUNTIME 15000
    );
```

### Show the status of model training (optional)
<a name="tutorial_linear_learner_regression_show_status"></a>

You can use the SHOW MODEL command to know when your model is ready.

Use the following query to monitor the progress of the model training.

```
SHOW MODEL model_abalone_ring_prediction;
```

When the model is ready, the output of the previous operation should look similar to the following example. Note that the output provides the `validation:mse` metric, which is the mean square error. You will use the mean square error to validate the accuracy of the model in the next step.

```
+--------------------------+----------------------------------------------------------------------------------------------------+
|        Model Name        |                                   model_abalone_ring_prediction                                    |
+--------------------------+----------------------------------------------------------------------------------------------------+
| Schema Name              | public                                                                                             |
| Owner                    | awsuser                                                                                            |
| Creation Time            | Thu, 30.06.2022 18:00:10                                                                           |
| Model State              | READY                                                                                              |
| validation:mse           |                                                                                           4.168633 |
| Estimated Cost           |                                                                                           4.291608 |
|                          |                                                                                                    |
| TRAINING DATA:           |                                                                                                    |
| Query                    | SELECT SEX , LENGTH , DIAMETER , HEIGHT , WHOLE , SHUCKED , VISCERA , SHELL, RINGS AS TARGET_LABEL |
|                          | FROM ABALONE_TRAINING                                                                              |
| Target Column            | TARGET_LABEL                                                                                       |
|                          |                                                                                                    |
| PARAMETERS:              |                                                                                                    |
| Model Type               | linear_learner                                                                                     |
| Problem Type             | Regression                                                                                         |
| Objective                | MSE                                                                                                |
| AutoML Job Name          | redshiftml-20220630180010947843                                                                    |
| Function Name            | f_abalone_ring_prediction                                                                          |
| Function Parameters      | sex length diameter height whole shucked viscera shell                                             |
| Function Parameter Types | bpchar float8 float8 float8 float8 float8 float8 float8                                            |
| IAM Role                 | default-aws-iam-role                                                                               |
| S3 Bucket                | amzn-s3-demo-bucket                                                                                |
| Max Runtime              |                                                                                              15000 |
+--------------------------+----------------------------------------------------------------------------------------------------+
```

## Step 3: Validate the model
<a name="tutorial_linear_learner_regression_step_validate"></a>

1. The following prediction query validates the accuracy of the model on the `abalone_validation` dataset by calculating mean square error and root mean square error.

   ```
   SELECT
       ROUND(AVG(POWER((tgt_label - predicted), 2)), 2) mse,
       ROUND(SQRT(AVG(POWER((tgt_label - predicted), 2))), 2) rmse
   FROM
       (
           SELECT
               Sex,
               Length,
               Diameter,
               Height,
               Whole,
               Shucked,
               Viscera,
               Shell,
               Rings AS tgt_label,
               f_abalone_ring_prediction(
                   Sex,
                   Length,
                   Diameter,
                   Height,
                   Whole,
                   Shucked,
                   Viscera,
                   Shell
               ) AS predicted,
               CASE
                   WHEN tgt_label = predicted then 1
                   ELSE 0
               END AS match,
               CASE
                   WHEN tgt_label <> predicted then 1
                   ELSE 0
               END AS nonmatch
           FROM
               abalone_validation
       ) t1;
   ```

   The output of the previous query should look like the following example. The value of the mean square error metric should be similar to the `validation:mse` metric shown by the SHOW MODEL operation’s output.

   ```
   +-----+--------------------+
   | mse |        rmse        |
   +-----+--------------------+
   | 5.1 | 2.2600000000000002 |
   +-----+--------------------+
   ```

1. Use the following query to run the EXPLAIN\$1MODEL operation on your prediction function. The operation will return a model explainability report. For more information about the EXPLAIN\$1MODEL operation, see the [EXPLAIN\$1MODEL function](https://docs.aws.amazon.com/redshift/latest/dg/r_explain_model_function.html) in the Amazon Redshift Database Developer Guide.

   ```
   SELECT
       EXPLAIN_MODEL ('model_abalone_ring_prediction');
   ```

   The following information is an example of the model explainability report produced by the previous EXPLAIN\$1MODEL operation. The values for each of the inputs are Shapley values. The Shapley values represent the effect each input has on the prediction of your model, with higher-valued inputs having more impact on the prediction. In this example, the higher-valued inputs have more impact on predicting the age of abalone.

   ```
   { 
       "explanations": { 
           "kernel_shap": { 
               "label0": { 
                   "expected_value" :10.290688514709473,
                   "global_shap_values": { 
                       "diameter" :0.6856910187882492,
                       "height" :0.4415323937124035,
                       "length" :0.21507476107609084,
                       "sex" :0.448611774505744,
                       "shell" :1.70426496893776,
                       "shucked" :2.1181392924386994,
                       "viscera" :0.342220754059912,
                       "whole" :0.6711906974084011 
                   } 
               } 
           } 
       },
       "version" :"1.0" 
   };
   ```

1. Use the following query to calculate the percentage of correct predictions that the model makes about abalone that are not yet mature. Abalone that are immature have 10 rings or less, and a correct prediction is accurate to within one ring of the actual number of rings.

   ```
   SELECT
       TRUNC(
           SUM(
               CASE
                   WHEN ROUND(
                       f_abalone_ring_prediction(
                           Sex,
                           Length,
                           Diameter,
                           Height,
                           Whole,
                           Shucked,
                           Viscera,
                           Shell
                       ),
                       0
                   ) BETWEEN Rings - 1
                   AND Rings + 1 THEN 1
                   ELSE 0
               END
           ) / CAST(COUNT(SHELL) AS FLOAT),
           4
       ) AS prediction_pct
   FROM
       abalone_validation
   WHERE
       Rings <= 10;
   ```

## Related topics
<a name="tutorial_linear_learner_regression_related_topics"></a>

For more information about Amazon Redshift ML, see the following documentation:
+ [Costs for using Amazon Redshift ML](https://docs.aws.amazon.com/redshift/latest/dg/cost.html)
+ [CREATE MODEL operation](https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_MODEL.html)
+ [EXPLAIN\$1MODEL function](https://docs.aws.amazon.com/redshift/latest/dg/r_explain_model_function.html)

For more information about machine learning, see the following documentation:
+ [Machine learning overview](https://docs.aws.amazon.com/redshift/latest/dg/machine_learning_overview.html)
+ [Machine learning for novices and experts](https://docs.aws.amazon.com/redshift/latest/dg/novice_expert.html)
+ [What Is Fairness and Model Explainability for Machine Learning Predictions?](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-fairness-and-explainability.html)

# Tutorial: Building multi-class classification models with linear learner
<a name="tutorial_linear_learner_multi-class_classification"></a>

In this tutorial, you create a linear learner model with data from Amazon S3, and then run prediction queries with the model using Amazon Redshift ML. The SageMaker AI linear learner algorithm solves either regression or classification problems. To learn more about regression and multi-class classification problems, see [Problem types for the machine learning paradigms](https://docs.aws.amazon.com/sagemaker/latest/dg/algorithms-choose.html#basic-machine-learning-paradigms) in the Amazon SageMaker AI Developer Guide. In this tutorial, you solve a multi-class classification problem. The linear learner algorithm trains many models in parallel, and automatically determines the most optimized model. You use the CREATE MODEL operation in Amazon Redshift, which creates your linear learner model using SageMaker AI and sends the prediction function to Amazon Redshift. For more information about the linear learner algorithm, see the [Linear Learner Algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/linear-learner.html) in the Amazon SageMaker AI Developer Guide.

You can use a CREATE MODEL command to export training data, train a model, import the model, and prepare an Amazon Redshift prediction function. Use the CREATE MODEL operation to specify training data either as a table or a SELECT statement.

Linear learner models optimize either continuous objectives or discrete objectives. Continuous objectives are used for regression, while discrete variables are used for classification. Some methods provide a solution for only continuous objectives, such as a regression method. The linear learner algorithm provides an increase in speed over naive hyperparameter optimization techniques, such as the Naive Bayes technique. A naive optimization technique assumes that each input variable is independent. The linear learner algorithm trains many models in parallel and selects the most optimized model. A similar algorithm is XGBoost, which combines estimates from a set of simpler and weaker models to make predictions. To learn more about XGBoost, see [XGBoost algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) in the Amazon SageMaker AI Developer Guide.

To use the linear learner algorithm, you must provide columns representing the dimensions of the inputs, and rows representing the observations. For more information about the linear learner algorithm, see the [Linear Learner Algorithm ](https://docs.aws.amazon.com/sagemaker/latest/dg/linear-learner.html)in the Amazon SageMaker AI Developer Guide. 

In this tutorial, you build a linear learner model that predicts the types of cover for a given area. You use the CREATE MODEL command on the [Covertype dataset](https://archive.ics.uci.edu/ml/datasets/covertype) from the UCI Machine Learning Repository. Then, you use the prediction function created by the command to determine the types of cover in a wilderness area. A forest cover type is usually a type of tree. The inputs that Redshift ML will use to create the model include soil type, distance to roadways, and wilderness area designation. For more information about the dataset, see the [Covertype Dataset](https://archive.ics.uci.edu/ml/datasets/covertype) from the UCI Machine Learning Repository.

## Use case examples
<a name="tutorial_linear_learner_multi-class_classification_tasks"></a>

You can solve other multi-class classification problems with linear learner with Amazon Redshift ML, such as predicting the species of a plant from an image. You could also predict the quantity of a product that a customer will purchase.

**Tasks**
+ Prerequisites
+ Step 1: Load the data from Amazon S3 to Amazon Redshift
+ Step 2: Create the machine learning model
+ Step 3: Validate the model

## Prerequisites
<a name="tutorial_linear_learner_multi-class_classification_prereqs"></a>

To complete this tutorial, you must complete the [Administrative setup](https://docs.aws.amazon.com/redshift/latest/dg/admin-setup.html) for Amazon Redshift ML.

## Step 1: Load the data from Amazon S3 to Amazon Redshift
<a name="tutorial_linear_learner_multi-class_classification_step_load"></a>

Use the [Amazon Redshift query editor v2](https://docs.aws.amazon.com/redshift/latest/mgmt/query-editor-v2-using.html) to run the following queries. These queries load the sample data into Redshift and divide the data into a training set and a validation set.

1. The following query creates the `covertype_data` table.

   ```
   CREATE TABLE public.covertype_data (
       elevation bigint ENCODE az64,
       aspect bigint ENCODE az64,
       slope bigint ENCODE az64,
       horizontal_distance_to_hydrology bigint ENCODE az64,
       vertical_distance_to_hydrology bigint ENCODE az64,
       horizontal_distance_to_roadways bigint ENCODE az64,
       hillshade_9am bigint ENCODE az64,
       hillshade_noon bigint ENCODE az64,
       hillshade_3pm bigint ENCODE az64,
       horizontal_distance_to_fire_points bigint ENCODE az64,
       wilderness_area1 bigint ENCODE az64,
       wilderness_area2 bigint ENCODE az64,
       wilderness_area3 bigint ENCODE az64,
       wilderness_area4 bigint ENCODE az64,
       soil_type1 bigint ENCODE az64,
       soil_type2 bigint ENCODE az64,
       soil_type3 bigint ENCODE az64,
       soil_type4 bigint ENCODE az64,
       soil_type5 bigint ENCODE az64,
       soil_type6 bigint ENCODE az64,
       soil_type7 bigint ENCODE az64,
       soil_type8 bigint ENCODE az64,
       soil_type9 bigint ENCODE az64,
       soil_type10 bigint ENCODE az64,
       soil_type11 bigint ENCODE az64,
       soil_type12 bigint ENCODE az64,
       soil_type13 bigint ENCODE az64,
       soil_type14 bigint ENCODE az64,
       soil_type15 bigint ENCODE az64,
       soil_type16 bigint ENCODE az64,
       soil_type17 bigint ENCODE az64,
       soil_type18 bigint ENCODE az64,
       soil_type19 bigint ENCODE az64,
       soil_type20 bigint ENCODE az64,
       soil_type21 bigint ENCODE az64,
       soil_type22 bigint ENCODE az64,
       soil_type23 bigint ENCODE az64,
       soil_type24 bigint ENCODE az64,
       soil_type25 bigint ENCODE az64,
       soil_type26 bigint ENCODE az64,
       soil_type27 bigint ENCODE az64,
       soil_type28 bigint ENCODE az64,
       soil_type29 bigint ENCODE az64,
       soil_type30 bigint ENCODE az64,
       soil_type31 bigint ENCODE az64,
       soil_type32 bigint ENCODE az64,
       soil_type33 bigint ENCODE az64,
       soil_type34 bigint ENCODE az64,
       soil_type35 bigint ENCODE az64,
       soil_type36 bigint ENCODE az64,
       soil_type37 bigint ENCODE az64,
       soil_type38 bigint ENCODE az64,
       soil_type39 bigint ENCODE az64,
       soil_type40 bigint ENCODE az64,
       cover_type bigint ENCODE az64
   ) DISTSTYLE AUTO;
   ```

1. The following query copies the sample data from the [Covertype dataset ](https://archive.ics.uci.edu/ml/datasets/covertype)in Amazon S3 to the `covertype_data` table you created previously in Amazon Redshift.

   ```
   COPY public.covertype_data
   FROM
       's3://redshift-ml-multiclass/covtype.data.gz' IAM_ROLE DEFAULT gzip DELIMITER ',' REGION 'us-east-1';
   ```

1. By manually splitting the data, you will be able to verify the accuracy of the model by allocating an additional testing set. The following query splits the data into three sets. The `covertype_training` table is for training, the `covertype_validation` table is for validation, and the `covertype_test` table is for testing your model. You will use the training set to train your model and the validation set to validate the model’s development. Then, you use the testing set to test the performance of the model and see if the model is overfitting or underfitting the dataset.

   ```
   CREATE TABLE public.covertype_data_prep AS
   SELECT
       a.*,
       CAST (random() * 100 AS int) AS data_group_id
   FROM
       public.covertype_data a;
   
   --training dataset
    CREATE TABLE public.covertype_training as
   SELECT
       *
   FROM
       public.covertype_data_prep
   WHERE
       data_group_id < 80;
   
   --validation dataset
    CREATE TABLE public.covertype_validation AS
   SELECT
       *
   FROM
       public.covertype_data_prep
   WHERE
       data_group_id BETWEEN 80
       AND 89;
   
   --test dataset
    CREATE TABLE public.covertype_test AS
   SELECT
       *
   FROM
       public.covertype_data_prep
   WHERE
       data_group_id > 89;
   ```

## Step 2: Create the machine learning model
<a name="tutorial_linear_learner_multi-class_classification_step_create_model"></a>

In this step, you use the CREATE MODEL statement to create your machine learning model with the linear learner algorithm. 

The following query creates the linear learner model with the CREATE MODEL operation using your S3 bucket. Replace amzn-s3-demo-bucket with your own S3 bucket.

```
CREATE MODEL forest_cover_type_model
FROM
    (
        SELECT
            Elevation,
            Aspect,
            Slope,
            Horizontal_distance_to_hydrology,
            Vertical_distance_to_hydrology,
            Horizontal_distance_to_roadways,
            HIllshade_9am,
            Hillshade_noon,
            Hillshade_3pm,
            Horizontal_Distance_To_Fire_Points,
            Wilderness_Area1,
            Wilderness_Area2,
            Wilderness_Area3,
            Wilderness_Area4,
            soil_type1,
            Soil_Type2,
            Soil_Type3,
            Soil_Type4,
            Soil_Type5,
            Soil_Type6,
            Soil_Type7,
            Soil_Type8,
            Soil_Type9,
            Soil_Type10,
            Soil_Type11,
            Soil_Type12,
            Soil_Type13,
            Soil_Type14,
            Soil_Type15,
            Soil_Type16,
            Soil_Type17,
            Soil_Type18,
            Soil_Type19,
            Soil_Type20,
            Soil_Type21,
            Soil_Type22,
            Soil_Type23,
            Soil_Type24,
            Soil_Type25,
            Soil_Type26,
            Soil_Type27,
            Soil_Type28,
            Soil_Type29,
            Soil_Type30,
            Soil_Type31,
            Soil_Type32,
            Soil_Type33,
            Soil_Type34,
            Soil_Type36,
            Soil_Type37,
            Soil_Type38,
            Soil_Type39,
            Soil_Type40,
            Cover_type
        from
            public.covertype_training
    ) TARGET cover_type FUNCTION predict_cover_type IAM_ROLE default MODEL_TYPE LINEAR_LEARNER PROBLEM_TYPE MULTICLASS_CLASSIFICATION OBJECTIVE 'Accuracy' SETTINGS (
        S3_BUCKET 'amzn-s3-demo-bucket',
        S3_GARBAGE_COLLECT OFF,
        MAX_RUNTIME 15000
    );
```

### Show the status of model training (optional)
<a name="tutorial_linear_learner_multi-class_classification_show_status"></a>

You can use the SHOW MODEL command to know when your model is ready.

Use the following query to monitor the progress of the model training.

```
SHOW MODEL forest_cover_type_model;
```

When the model is ready, the output of the previous operation should look similar to the following example. Note that the output provides the `validation:multiclass_accuracy` metric, which you can view on the righthand side of the following example. Multi-class accuracy measures the percentage of data points that are classified correctly by the model. You will use multi-class accuracy to validate the accuracy of the model in the next step.

```
+--------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|              Key               |                                                                                                                                                                                                                                                                                                                                                                                                             Value                                                                                                                                                                                                                                                                                                                                                                                                              |
+--------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Model Name                     | forest_cover_type_model                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| Schema Name                    | public                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| Owner                          | awsuser                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| Creation Time                  | Tue, 12.07.2022 20:24:32                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| Model State                    | READY                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| validation:multiclass_accuracy |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       0.724952 |
| Estimated Cost                 |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       5.341750 |
|                                |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| TRAINING DATA:                 |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| Query                          | SELECT ELEVATION, ASPECT, SLOPE, HORIZONTAL_DISTANCE_TO_HYDROLOGY, VERTICAL_DISTANCE_TO_HYDROLOGY, HORIZONTAL_DISTANCE_TO_ROADWAYS, HILLSHADE_9AM, HILLSHADE_NOON, HILLSHADE_3PM , HORIZONTAL_DISTANCE_TO_FIRE_POINTS, WILDERNESS_AREA1, WILDERNESS_AREA2, WILDERNESS_AREA3, WILDERNESS_AREA4, SOIL_TYPE1, SOIL_TYPE2, SOIL_TYPE3, SOIL_TYPE4, SOIL_TYPE5, SOIL_TYPE6, SOIL_TYPE7, SOIL_TYPE8, SOIL_TYPE9, SOIL_TYPE10 , SOIL_TYPE11, SOIL_TYPE12 , SOIL_TYPE13 , SOIL_TYPE14, SOIL_TYPE15, SOIL_TYPE16, SOIL_TYPE17, SOIL_TYPE18, SOIL_TYPE19, SOIL_TYPE20, SOIL_TYPE21, SOIL_TYPE22, SOIL_TYPE23, SOIL_TYPE24, SOIL_TYPE25, SOIL_TYPE26, SOIL_TYPE27, SOIL_TYPE28, SOIL_TYPE29, SOIL_TYPE30, SOIL_TYPE31, SOIL_TYPE32, SOIL_TYPE33, SOIL_TYPE34, SOIL_TYPE36, SOIL_TYPE37, SOIL_TYPE38, SOIL_TYPE39, SOIL_TYPE40, COVER_TYPE |
|                                | FROM PUBLIC.COVERTYPE_TRAINING                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| Target Column                  | COVER_TYPE                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
|                                |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| PARAMETERS:                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| Model Type                     | linear_learner                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| Problem Type                   | MulticlassClassification                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| Objective                      | Accuracy                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| AutoML Job Name                | redshiftml-20220712202432187659                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| Function Name                  | predict_cover_type                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
| Function Parameters            | elevation aspect slope horizontal_distance_to_hydrology vertical_distance_to_hydrology horizontal_distance_to_roadways hillshade_9am hillshade_noon hillshade_3pm horizontal_distance_to_fire_points wilderness_area1 wilderness_area2 wilderness_area3 wilderness_area4 soil_type1 soil_type2 soil_type3 soil_type4 soil_type5 soil_type6 soil_type7 soil_type8 soil_type9 soil_type10 soil_type11 soil_type12 soil_type13 soil_type14 soil_type15 soil_type16 soil_type17 soil_type18 soil_type19 soil_type20 soil_type21 soil_type22 soil_type23 soil_type24 soil_type25 soil_type26 soil_type27 soil_type28 soil_type29 soil_type30 soil_type31 soil_type32 soil_type33 soil_type34 soil_type36 soil_type37 soil_type38 soil_type39 soil_type40                                                                            |
| Function Parameter Types       | int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| IAM Role                       | default-aws-iam-role                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| S3 Bucket                      | amzn-s3-demo-bucket                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| Max Runtime                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          15000 |
+--------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```

## Step 3: Validate the model
<a name="tutorial_linear_learner_multi-class_classification_step_validate"></a>

1. The following prediction query validates the accuracy of the model on the `covertype_validation` dataset by calculating multi-class accuracy. Multi-class accuracy is the percentage of the model’s predictions that are correct.

   ```
   SELECT
       CAST(sum(t1.match) AS decimal(7, 2)) AS predicted_matches,
       CAST(sum(t1.nonmatch) AS decimal(7, 2)) AS predicted_non_matches,
       CAST(sum(t1.match + t1.nonmatch) AS decimal(7, 2)) AS total_predictions,
       predicted_matches / total_predictions AS pct_accuracy
   FROM
       (
           SELECT
               Elevation,
               Aspect,
               Slope,
               Horizontal_distance_to_hydrology,
               Vertical_distance_to_hydrology,
               Horizontal_distance_to_roadways,
               HIllshade_9am,
               Hillshade_noon,
               Hillshade_3pm,
               Horizontal_Distance_To_Fire_Points,
               Wilderness_Area1,
               Wilderness_Area2,
               Wilderness_Area3,
               Wilderness_Area4,
               soil_type1,
               Soil_Type2,
               Soil_Type3,
               Soil_Type4,
               Soil_Type5,
               Soil_Type6,
               Soil_Type7,
               Soil_Type8,
               Soil_Type9,
               Soil_Type10,
               Soil_Type11,
               Soil_Type12,
               Soil_Type13,
               Soil_Type14,
               Soil_Type15,
               Soil_Type16,
               Soil_Type17,
               Soil_Type18,
               Soil_Type19,
               Soil_Type20,
               Soil_Type21,
               Soil_Type22,
               Soil_Type23,
               Soil_Type24,
               Soil_Type25,
               Soil_Type26,
               Soil_Type27,
               Soil_Type28,
               Soil_Type29,
               Soil_Type30,
               Soil_Type31,
               Soil_Type32,
               Soil_Type33,
               Soil_Type34,
               Soil_Type36,
               Soil_Type37,
               Soil_Type38,
               Soil_Type39,
               Soil_Type40,
               Cover_type AS actual_cover_type,
               predict_cover_type(
                   Elevation,
                   Aspect,
                   Slope,
                   Horizontal_distance_to_hydrology,
                   Vertical_distance_to_hydrology,
                   Horizontal_distance_to_roadways,
                   HIllshade_9am,
                   Hillshade_noon,
                   Hillshade_3pm,
                   Horizontal_Distance_To_Fire_Points,
                   Wilderness_Area1,
                   Wilderness_Area2,
                   Wilderness_Area3,
                   Wilderness_Area4,
                   soil_type1,
                   Soil_Type2,
                   Soil_Type3,
                   Soil_Type4,
                   Soil_Type5,
                   Soil_Type6,
                   Soil_Type7,
                   Soil_Type8,
                   Soil_Type9,
                   Soil_Type10,
                   Soil_Type11,
                   Soil_Type12,
                   Soil_Type13,
                   Soil_Type14,
                   Soil_Type15,
                   Soil_Type16,
                   Soil_Type17,
                   Soil_Type18,
                   Soil_Type19,
                   Soil_Type20,
                   Soil_Type21,
                   Soil_Type22,
                   Soil_Type23,
                   Soil_Type24,
                   Soil_Type25,
                   Soil_Type26,
                   Soil_Type27,
                   Soil_Type28,
                   Soil_Type29,
                   Soil_Type30,
                   Soil_Type31,
                   Soil_Type32,
                   Soil_Type33,
                   Soil_Type34,
                   Soil_Type36,
                   Soil_Type37,
                   Soil_Type38,
                   Soil_Type39,
                   Soil_Type40
               ) AS predicted_cover_type,
               CASE
                   WHEN actual_cover_type = predicted_cover_type THEN 1
                   ELSE 0
               END AS match,
               CASE
                   WHEN actual_cover_type <> predicted_cover_type THEN 1
                   ELSE 0
               END AS nonmatch
           FROM
               public.covertype_validation
       ) t1;
   ```

   The output of the previous query should look like the following example. The value of the multi-class accuracy metric should be similar to the `validation:multiclass_accuracy` metric shown by the SHOW MODEL operation’s output.

   ```
   +-------------------+-----------------------+-------------------+--------------+
   | predicted_matches | predicted_non_matches | total_predictions | pct_accuracy |
   +-------------------+-----------------------+-------------------+--------------+
   |             41211 |                 16324 |             57535 |   0.71627704 |
   +-------------------+-----------------------+-------------------+--------------+
   ```

1. The following query predicts the most common cover type for `wilderness_area2`. This dataset includes four wilderness areas and seven cover types. A wilderness area can have multiple cover types.

   ```
   SELECT t1. predicted_cover_type, COUNT(*) 
   FROM 
   (     
   SELECT 
      Elevation, 
      Aspect,  
      Slope,  
      Horizontal_distance_to_hydrology,  
      Vertical_distance_to_hydrology,   
      Horizontal_distance_to_roadways,  
      HIllshade_9am,  
      Hillshade_noon,  
      Hillshade_3pm , 
      Horizontal_Distance_To_Fire_Points,  
      Wilderness_Area1,  
      Wilderness_Area2,   
      Wilderness_Area3,  
      Wilderness_Area4,  
      soil_type1,
      Soil_Type2,
      Soil_Type3, 
      Soil_Type4, 
      Soil_Type5,  
      Soil_Type6, 
      Soil_Type7,  
      Soil_Type8, 
      Soil_Type9, 
      Soil_Type10 , 
      Soil_Type11,  
      Soil_Type12 , 
      Soil_Type13 ,
      Soil_Type14, 
      Soil_Type15, 
      Soil_Type16,  
      Soil_Type17, 
      Soil_Type18,  
      Soil_Type19,  
      Soil_Type20,  
      Soil_Type21,  
      Soil_Type22,  
      Soil_Type23,  
      Soil_Type24, 
      Soil_Type25,  
      Soil_Type26, 
      Soil_Type27,  
      Soil_Type28,  
      Soil_Type29,  
      Soil_Type30,  
      Soil_Type31, 
      Soil_Type32, 
      Soil_Type33,  
      Soil_Type34, 
      Soil_Type36, 
      Soil_Type37,  
      Soil_Type38,  
      Soil_Type39, 
      Soil_Type40,
      predict_cover_type( Elevation, 
      Aspect,  
      Slope,  
      Horizontal_distance_to_hydrology,  
      Vertical_distance_to_hydrology,   
      Horizontal_distance_to_roadways,  
      HIllshade_9am,  
      Hillshade_noon,  
      Hillshade_3pm , 
      Horizontal_Distance_To_Fire_Points,  
      Wilderness_Area1,  
      Wilderness_Area2,   
      Wilderness_Area3,  
      Wilderness_Area4,  
      soil_type1,
      Soil_Type2,
      Soil_Type3, 
      Soil_Type4, 
      Soil_Type5,  
      Soil_Type6, 
      Soil_Type7,  
      Soil_Type8, 
      Soil_Type9, 
      Soil_Type10,
      Soil_Type11,  
      Soil_Type12, 
      Soil_Type13,
      Soil_Type14, 
      Soil_Type15, 
      Soil_Type16,  
      Soil_Type17, 
      Soil_Type18,  
      Soil_Type19,  
      Soil_Type20,  
      Soil_Type21,  
      Soil_Type22,  
      Soil_Type23,  
      Soil_Type24, 
      Soil_Type25,  
      Soil_Type26, 
      Soil_Type27,  
      Soil_Type28,  
      Soil_Type29,  
      Soil_Type30,  
      Soil_Type31, 
      Soil_Type32, 
      Soil_Type33,  
      Soil_Type34, 
      Soil_Type36, 
      Soil_Type37,  
      Soil_Type38,  
      Soil_Type39, 
      Soil_Type40) AS predicted_cover_type 
   
   FROM public.covertype_test
   WHERE wilderness_area2 = 1)
   t1
   GROUP BY 1;
   ```

   The output of the previous operation should look similar to the following example. This output means that the model predicted that the majority of cover is cover type 1, and there is some cover of cover types 2 and 7.

   ```
   +----------------------+-------+
   | predicted_cover_type | count |
   +----------------------+-------+
   |                    2 |   564 |
   |                    7 |    97 |
   |                    1 |  2309 |
   +----------------------+-------+
   ```

1. The following query shows the most common cover type in a single wilderness area. The query displays the amount of that cover type and the cover type’s wilderness area.

   ```
   SELECT t1. predicted_cover_type, COUNT(*), wilderness_area 
   FROM 
   (     
   SELECT 
      Elevation, 
      Aspect,  
      Slope,  
      Horizontal_distance_to_hydrology,  
      Vertical_distance_to_hydrology,   
      Horizontal_distance_to_roadways,  
      HIllshade_9am,  
      Hillshade_noon,  
      Hillshade_3pm , 
      Horizontal_Distance_To_Fire_Points,  
      Wilderness_Area1,  
      Wilderness_Area2,   
      Wilderness_Area3,  
      Wilderness_Area4,  
      soil_type1,
      Soil_Type2,
      Soil_Type3, 
      Soil_Type4, 
      Soil_Type5,  
      Soil_Type6, 
      Soil_Type7,  
      Soil_Type8, 
      Soil_Type9, 
      Soil_Type10 , 
      Soil_Type11,  
      Soil_Type12 , 
      Soil_Type13 ,
      Soil_Type14, 
      Soil_Type15, 
      Soil_Type16,  
      Soil_Type17, 
      Soil_Type18,  
      Soil_Type19,  
      Soil_Type20,  
      Soil_Type21,  
      Soil_Type22,  
      Soil_Type23,  
      Soil_Type24, 
      Soil_Type25,  
      Soil_Type26, 
      Soil_Type27,  
      Soil_Type28,  
      Soil_Type29,  
      Soil_Type30,  
      Soil_Type31, 
      Soil_Type32, 
      Soil_Type33,  
      Soil_Type34, 
      Soil_Type36, 
      Soil_Type37,  
      Soil_Type38,  
      Soil_Type39, 
      Soil_Type40,
      predict_cover_type( Elevation, 
      Aspect,  
      Slope,  
      Horizontal_distance_to_hydrology,  
      Vertical_distance_to_hydrology,   
      Horizontal_distance_to_roadways,  
      HIllshade_9am,  
      Hillshade_noon,  
      Hillshade_3pm , 
      Horizontal_Distance_To_Fire_Points,  
      Wilderness_Area1,  
      Wilderness_Area2,   
      Wilderness_Area3,  
      Wilderness_Area4,  
      soil_type1,
      Soil_Type2,
      Soil_Type3, 
      Soil_Type4, 
      Soil_Type5,  
      Soil_Type6, 
      Soil_Type7,  
      Soil_Type8, 
      Soil_Type9, 
      Soil_Type10,
      Soil_Type11,  
      Soil_Type12, 
      Soil_Type13,
      Soil_Type14, 
      Soil_Type15, 
      Soil_Type16,  
      Soil_Type17, 
      Soil_Type18,  
      Soil_Type19,  
      Soil_Type20,  
      Soil_Type21,  
      Soil_Type22,  
      Soil_Type23,  
      Soil_Type24, 
      Soil_Type25,  
      Soil_Type26, 
      Soil_Type27,  
      Soil_Type28,  
      Soil_Type29,  
      Soil_Type30,  
      Soil_Type31, 
      Soil_Type32, 
      Soil_Type33,  
      Soil_Type34, 
      Soil_Type36, 
      Soil_Type37,  
      Soil_Type38,  
      Soil_Type39, 
      Soil_Type40) AS predicted_cover_type,
      CASE WHEN Wilderness_Area1 = 1 THEN 1
           WHEN Wilderness_Area2 = 1 THEN 2
           WHEN Wilderness_Area3 = 1 THEN 3
           WHEN Wilderness_Area4 = 1 THEN 4
           ELSE 0
      END AS wilderness_area
   
   FROM public.covertype_test)
   t1
   GROUP BY 1, 3
   ORDER BY 2 DESC
   LIMIT 1;
   ```

   The output of the previous operation should look similar to the following example.

   ```
   +----------------------+-------+-----------------+
   | predicted_cover_type | count | wilderness_area |
   +----------------------+-------+-----------------+
   |                    2 | 15738 |               1 |
   +----------------------+-------+-----------------+
   ```

## Related topics
<a name="tutorial_linear_learner_multi-class_classification_related_topics"></a>

For more information about Amazon Redshift ML, see the following documentation:
+ [Costs for using Amazon Redshift ML](https://docs.aws.amazon.com/redshift/latest/dg/cost.html)
+ [CREATE MODEL operation](https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_MODEL.html)
+ [EXPLAIN\$1MODEL function](https://docs.aws.amazon.com/redshift/latest/dg/r_explain_model_function.html)

For more information about machine learning, see the following documentation:
+ [Machine learning overview](https://docs.aws.amazon.com/redshift/latest/dg/machine_learning_overview.html)
+ [Machine learning for novices and experts](https://docs.aws.amazon.com/redshift/latest/dg/novice_expert.html)
+ [What Is Fairness and Model Explainability for Machine Learning Predictions?](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-fairness-and-explainability.html)