

 Amazon Redshift will no longer support the creation of new Python UDFs starting Patch 198. Existing Python UDFs will continue to function until June 30, 2026. For more information, see the [ blog post ](https://aws.amazon.com/blogs/big-data/amazon-redshift-python-user-defined-functions-will-reach-end-of-support-after-june-30-2026/). 

# Machine learning
<a name="machine_learning"></a>

Amazon Redshift machine learning (Amazon Redshift ML) is a robust, cloud-based service that makes it easier for analysts and data scientists of all skill levels to use machine learning technology. Amazon Redshift ML uses a model to generate results. You can use models in the following ways:
+ You can provide the data that you want to train a model, and metadata associated with data inputs to Amazon Redshift. Then Amazon Redshift ML creates models in Amazon SageMaker AI that capture patterns in the input data. By using your own data for the model, you can use Amazon Redshift ML to identify trends in the data, such as churn prediction, customer lifetime value, or revenue prediction. You can use these models to generate predictions for new input data without incurring additional costs.
+ You can use one of the Foundation Models (FM) provided by Amazon Bedrock, such as Claude or Amazon Titan. Using Amazon Bedrock, you can combine the power of large language models (LLMs) with your analytics data in Amazon Redshift in a few steps. By using an external Large Language Model (LLM), you can use Amazon Redshift to perform Natural Language Processing (NLP) on your data. You can use NLP for such applications as text generation, sentiment analysis, or translation. For information about using Amazon Bedrock with Amazon Redshift see [Amazon Redshift ML integration with Amazon Bedrock](machine-learning-br.md).

**Note**  
**Opting out of using your data for service improvement**  
If you are using Amazon Bedrock models, we encourage you to read the AWS policies about how the Amazon Bedrock service handles your data. You should determine if you need to use an opt-out policy to prevent the service from using your data for model or service improvements, should Amazon Bedrock implement such functionality in the future. To ensure that the service doesn't use your data for such purposes, use the general AWS opt-out policy.   
For more information, see the following:  
[AI services opt-out policies](https://docs.aws.amazon.com/organizations/latest/userguide/orgs_manage_policies_ai-opt-out.html)
[Amazon Bedrock FAQs](https://aws.amazon.com/bedrock/faqs/)

**Note**  
 LLMs can generate inaccurate or incomplete information. We recommend verifying the information that LLMs produce to ensure that it is accurate and complete.

**How Amazon Redshift ML works with Amazon SageMaker AI**

Amazon Redshift works with Amazon SageMaker AI Autopilot to automatically obtain the best model and make the prediction function available in Amazon Redshift.

The following diagram illustrates how Amazon Redshift ML works.

![\[Workflow for Amazon Redshift ML integrating with Amazon SageMaker AI Autopilot.\]](http://docs.aws.amazon.com/redshift/latest/dg/images/machine_learning_overview.png)


The general workflow is as follows:

1. Amazon Redshift exports the training data into Amazon S3. 

1. Amazon SageMaker AI Autopilot preprocesses the training data. *Preprocessing* performs important functions, such as imputing missing values. It recognizes that certain columns are categorical (such as the postal code), properly formats them for training, and performs numerous other tasks. Choosing the best preprocessors to apply on the training dataset is a problem in itself, and Amazon SageMaker AI Autopilot automates its solution.

1. Amazon SageMaker AI Autopilot finds the algorithm and algorithm hyperparameters that deliver the model with the most accurate predictions.

1. Amazon Redshift registers the prediction function as a SQL function in your Amazon Redshift cluster.

1. When you run CREATE MODEL statements, Amazon Redshift uses Amazon SageMaker AI for training. Therefore, there is an associated cost for training your model. This is a separate line item for Amazon SageMaker AI in your AWS bill. You also pay for the storage used in Amazon S3 for storing your training data. Inference using models created with CREATE MODEL that you can compile and run on your Redshift cluster aren't charged. There are no additional Amazon Redshift charges for using Amazon Redshift ML.

**Topics**
+ [Machine learning overview](machine_learning_overview.md)
+ [Machine learning for novices and experts](novice_expert.md)
+ [Costs for using Amazon Redshift ML](cost.md)
+ [Getting started with Amazon Redshift ML](getting-started-machine-learning.md)
+ [Tutorials for Amazon Redshift ML](tutorials_for_amazon_redshift_ml.md)
+ [Amazon Redshift ML integration with Amazon Bedrock](machine-learning-br.md)

# Machine learning overview
<a name="machine_learning_overview"></a>

With Amazon Redshift, you can leverage machine learning capabilities to gain valuable insights from your data. This Machine Learning (ML) overview shows you how to explore, visualize, and prepare your data for ML model training and deployment. The following sections will guide you through the process of utilizing Amazon Redshift ML to unlock the potential of your data through machine learning.

By using Amazon Redshift ML, you can train machine learning models using SQL statements and invoke them in SQL queries for prediction.

To help you learn how to use Amazon Redshift ML, you can watch the following video. 

[![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/pJF2kYGtO4A/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/pJF2kYGtO4A)


For information about the prerequisites for setting up your Redshift cluster or Serverless workgroup, permissions, and ownership for using Amazon Redshift ML, read the following sections. These sections also describe how simple training and predictions work in Amazon Redshift ML.

## How machine learning can solve a problem
<a name="solve_problem"></a>

A machine learning model generates predictions by finding patterns in your training data and then applying these patterns to new data. In machine learning, you train these models by learning the patterns that best explain your data. Then you use the models to make predictions (also called inferences) on new data. Machine learning is typically an iterative process where you can continue to improve the accuracy of the predictions by changing parameters and improving your training data. If data changes, retraining new models with the new dataset happens. 

To address various business goals, there are different fundamental machine learning approaches.

### Supervised learning in Amazon Redshift ML
<a name="supervised_learning"></a>

Amazon Redshift supports supervised learning, which is the most common approach to advanced enterprise analytics. Supervised learning is the preferred machine learning approach when you have an established set of data and an understanding of how specific input data predicts various business outcomes. These outcomes are sometimes called labels. In particular, your dataset is a table with attributes that comprise features (inputs) and targets (outputs). For example, suppose that you have a table that provides the age and postal code for past and present customers. Suppose that you also have a field “active” that is true for present customers and false for customers who have suspended their membership. The goal of supervised machine learning is to spot the patterns of age and postal code leading to customer churn, as represented by customers whose targets are “False.” You can use this model to predict customers who are likely to churn, such as suspending their membership, and potentially offer retention incentives.

Amazon Redshift supports supervised learning that includes regression, binary classification, and multiclass classification. Regression refers to the problem of predicting continuous values, such as the total spending of customers. Binary classification refers to the problem of predicting one of two outcomes, such as predicting whether a customer churns or not. Multiclass classification refers to the problem of predicting one of many outcomes, such as predicting the item a customer might be interested. Data analysts and data scientists can use it to perform supervised learning to tackle problems ranging from forecasting, personalization, or customer churn prediction. You can also use supervised learning in problems such as prediction of which sales will close, revenue prediction, fraud detection, and customer life-time value prediction. 

### Unsupervised learning in Amazon Redshift ML
<a name="unsupervised_learning"></a>

Unsupervised learning uses machine learning algorithms to analyze and group unlabeled training data. The algorithms discover hidden patterns or groupings. The goal is to model the underlying structure or distribution in the data to learn more about the data.

Amazon Redshift supports the K-Means clustering algorithm to solve an unsupervised learning problem. This algorithm solves clustering problems where you want to discover groupings in the data. The K-Means algorithm attempts to find discrete groupings within the data. Unclassified data is grouped and partitioned based on its similarities and differences. By grouping, the K-Means algorithm iteratively determines the best centroids and assigns each member to the closest centroid. Members nearest the same centroid belong to the same group. Members of a group are as similar as possible to other members in the same group, and as different as possible from members of other groups. For example, the K-Means clustering algorithm can be used to classify cities impacted by a pandemic or classify cities based on the popularity of consumer products.

When using the K-Means algorithm, you specify an input `k` that specifies the number of clusters to find in the data. The output of this algorithm is a set of k centroids. Each data point belongs to one of the k clusters that is closest to it. Each cluster is described by its centroid. The centroid can be thought of as the multi-dimensional average of the cluster. The K-Means algorithm compares the distances to see how different the clusters are from each other. A larger distance generally indicates a greater difference between the clusters.

Preprocessing the data is important for K-Means, as it ensures that the features of the model stay on the same scale and produce reliable results. Amazon Redshift supports some K-Means preprocessors for the CREATE MODEL statement, such as StandardScaler, MinMax, and NumericPassthrough. If you don't want to apply any preprocessing for K-means, choose NumericPassthrough explicitly as a transformer. For more information about K-Means parameters, see [CREATE MODEL with K-MEANS parameters](r_create_model_use_cases.md#r_k-means-create-model-parameters).

To help you learn how to perform unsupervised training with K-Means clustering, you can watch the following video. 

[![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/TFKgl5d0U_0/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/TFKgl5d0U_0)


## Terms and concepts for Amazon Redshift ML
<a name="terminology"></a>

The following terms are used to describe some Amazon Redshift ML concepts:
+ *Machine learning* in Amazon Redshift trains a model with one SQL command. Amazon Redshift ML and Amazon SageMaker AI manage all the data conversions, permissions, resource usage, and discovery of the proper model. 
+ *Training* is the phase when Amazon Redshift creates a machine learning model by running a specified subset of data into the model. Amazon Redshift automatically launches a training job in Amazon SageMaker AI and generates a model. 
+ *Prediction* (also called *inference*) is the use of the model in Amazon Redshift SQL queries to predict outcomes. At inference time, Amazon Redshift uses a model-based prediction function as part of a larger query to produce predictions. The predictions are computed locally, at the Redshift cluster, thus providing high throughput, low latency, and zero additional cost.
+ With *bring your own model (BYOM)*, you can use a model trained outside of Amazon Redshift with Amazon SageMaker AI for in-database inference locally in Amazon Redshift. Amazon Redshift ML supports using BYOM in local inference.
+ *Local inference* is used when models are pretrained in Amazon SageMaker AI, compiled by Amazon SageMaker AI Neo, and localized in Amazon Redshift ML. To import models that are supported for local inference to Amazon Redshift, use the CREATE MODEL command. Amazon Redshift imports the pretrained SageMaker AI models by calling Amazon SageMaker AI Neo. You compile the model there and import the compiled model into Amazon Redshift. Use local inference for faster speed and lower costs.
+ *Remote inference* is used when Amazon Redshift invokes a model endpoint deployed in SageMaker AI. Remote inference provides the flexibility to invoke all types of custom models and deep learning models, such as TensorFlow models that you built and deployed in Amazon SageMaker AI.

Also important are the following:
+ *Amazon SageMaker AI* is a fully managed machine learning service. With Amazon SageMaker AI, data scientists and developers can easily build, train, and directly deploy models into a production-ready hosted environment. For information about Amazon SageMaker AI, see [What is Amazon SageMaker AI](https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html) in the *Amazon SageMaker AI Developer Guide*.
+ *Amazon SageMaker AI Autopilot* is a feature set that automatically trains and tunes the best machine learning models for classification or regression, based on your data. You maintain full control and visibility. Amazon SageMaker AI Autopilot supports input data in tabular format. Amazon SageMaker AI Autopilot provides automatic data cleaning and preprocessing, automatic algorithm selection for linear regression, binary classification, and multiclass classification. It also supports automatic hyperparameter optimization (HPO), distributed training, automatic instance, and cluster size selection. For information about Amazon SageMaker AI Autopilot, see [Automate model development with Amazon SageMaker AI Autopilot](https://docs.aws.amazon.com/sagemaker/latest/dg/autopilot-automate-model-development.html) in the *Amazon SageMaker AI Developer Guide*.
+ *Amazon Bedrock* is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon via a single API, along with a broad set of capabilities needed to build generative AI applications.

# Machine learning for novices and experts
<a name="novice_expert"></a>

With Amazon Redshift, you can leverage Machine Learning (ML) capabilities to gain insights from your data, whether you are a novice or an expert in ML. Machine Learning is an Amazon Redshift feature that enables you to create, train, and deploy ML models using SQL commands, without the need for extensive ML expertise or complex data engineering.

The following sections guide you through the process of utilizing Machine Learning, empowering you to unlock the full potential of your data with Amazon Redshift.

Amazon Redshift ML enables you to train models with one single SQL CREATE MODEL command. The CREATE MODEL command creates a model that Amazon Redshift uses to generate model-based predictions with familiar SQL constructs.

Amazon Redshift ML is especially useful when you don't have expertise in machine learning, tools, languages, algorithms, and APIs. With Amazon Redshift ML, you don't have to perform the undifferentiated heavy lifting required for integrating with an external machine learning service. Amazon Redshift saves you the time to format and move data, manage permission controls, or build custom integrations, workflows, and scripts. You can easily use popular machine learning algorithms and simplify training needs that require frequent iteration from training to prediction. Amazon Redshift automatically discovers the best algorithm and tunes the best model for your problem. You can make predictions from within the Amazon Redshift cluster without the need to move data out of Amazon Redshift nor to interface with and pay for another service.

Amazon Redshift ML supports data analysts and data scientists in using machine learning. It also makes it possible for machine learning experts to use their knowledge to guide the CREATE MODEL statement to use only the aspects that they specify. By doing so, you can speed up the time that CREATE MODEL needs to find the best candidate, improve the accuracy of the model, or both. 

The CREATE MODEL statement offers flexibility in how you can specify the parameters to training job. Using this flexibility, both machine learning novices or experts can choose their preferred preprocessors, algorithms, problem types, and hyperparameters. For example, a user interested in customer churn might specify for the CREATE MODEL statement that the problem type is a binary classification, which works well for customer churn. Then the CREATE MODEL statement narrows down its search for the best model into binary classification models. Even with the user choice of the problem type, there are still many options that the CREATE MODEL statement can work with. For example, the CREATE MODEL discovers and applies the best preprocessing transformations and discovers the best hyperparameter settings.

Amazon Redshift ML makes training easier by automatically finding the best model using Amazon SageMaker AI Autopilot. Behind the scenes, Amazon SageMaker AI Autopilot automatically trains and tunes the best machine learning model based on your supplied data. Amazon SageMaker AI Neo then compiles the training model and makes it available for prediction in your Redshift cluster. When you run a machine learning inference query using a trained model, the query can use the massively parallel processing capabilities of Amazon Redshift. At the same time, the query can use machine learning–based prediction. 
+ As a *machine learning beginner*, with general knowledge of different aspects of machine learning such as preprocessors, algorithms, and hyperparameters, use the CREATE MODEL statement for only the aspects that you specify. Then you can shorten the time that CREATE MODEL needs to find the best candidate or improve the accuracy of the model. Also, you can increase the business value of the predictions by introducing additional domain knowledge such as the problem type or the objective. For example, in a customer churn scenario, if the outcome “customer is not active” is rare, then the F1 objective is often preferred to the Accuracy objective. Because high Accuracy models might predict “customer is active” all the time, this results in high accuracy but little business value. For information about F1 objectives, see [AutoMLJobObjective](https://docs.aws.amazon.com//sagemaker/latest/APIReference/API_AutoMLJobObjective.html) in the *Amazon SageMaker AI API Reference*.

  For more information about the basic options for the CREATE MODEL statement, see [Simple CREATE MODEL](r_create_model_use_cases.md#r_simple_create_model).
+ As a *machine learning advanced practitioner*, you can specify the problem type and preprocessors for certain (but not all) features. Then CREATE MODEL follows your suggestions on the specified aspects. At the same time, CREATE MODEL still discovers the best preprocessors for the remaining features and the best hyperparameters. For more information about how you can constrain one or more aspects of the training pipeline, see [CREATE MODEL with user guidance](r_create_model_use_cases.md#r_user_guidance_create_model).
+ As a *machine learning expert*, you can take full control of training and hyperparameter tuning. Then the CREATE MODEL statement doesn't attempt to discover the optimal preprocessors, algorithms, and hyperparameters because you make all the choices. For more information about how to use CREATE MODEL with AUTO OFF, see [CREATE XGBoost models with AUTO OFF](r_create_model_use_cases.md#r_auto_off_create_model).
+ As a *data engineer*, you can bring a pretrained XGBoost model in Amazon SageMaker AI and import it into Amazon Redshift for local inference. With bring your own model (BYOM), you can use a model trained outside of Amazon Redshift with Amazon SageMaker AI for in-database inference locally in Amazon Redshift. Amazon Redshift ML supports using BYOM in either local or remote inference.

  For more information about how to use the CREATE MODEL statement for local or remote inference, see [Bring your own model (BYOM) - local inference](r_create_model_use_cases.md#r_byom_create_model). 

As an Amazon Redshift ML user, you can choose any of the following options to train and deploy your model:
+ Problem types, see [CREATE MODEL with user guidance](r_create_model_use_cases.md#r_user_guidance_create_model).
+ Objectives, see [CREATE MODEL with user guidance](r_create_model_use_cases.md#r_user_guidance_create_model) or [CREATE XGBoost models with AUTO OFF](r_create_model_use_cases.md#r_auto_off_create_model).
+ Model types, see [CREATE XGBoost models with AUTO OFF](r_create_model_use_cases.md#r_auto_off_create_model).
+ Preprocessors, see [CREATE MODEL with user guidance](r_create_model_use_cases.md#r_user_guidance_create_model).
+ Hyperparameters, see [CREATE XGBoost models with AUTO OFF](r_create_model_use_cases.md#r_auto_off_create_model).
+ Bring your own model (BYOM), see [Bring your own model (BYOM) - local inference](r_create_model_use_cases.md#r_byom_create_model).

# Costs for using Amazon Redshift ML
<a name="cost"></a>

With Amazon Redshift, you can leverage machine learning capabilities to gain insights from your data without the need for extensive data engineering or machine learning expertise. The following sections describe the costs associated with using Amazon Redshift ML, helping you plan and optimize your expenses while leveraging this powerful machine learning integration.

## Costs for using Amazon Redshift ML with SageMaker AI
<a name="cost_sm"></a>

Amazon Redshift ML for SageMaker AI uses your existing cluster resources for prediction so you can avoid additional Amazon Redshift charges. There is no additional Amazon Redshift charge for creating or using a model. Prediction happens locally in your Redshift cluster, so you don't have to pay extra unless you need to resize your cluster. Amazon Redshift ML uses Amazon SageMaker AI for training your model, which does have an additional associated cost.

There is no additional charge for prediction functions that run within your Amazon Redshift cluster. The CREATE MODEL statement uses Amazon SageMaker AI and incurs an additional cost. The cost increases with the number of cells in your training data. The number of cells is the product of the number of records (in the training query or table times) times the number of columns. For example, when a SELECT query of the CREATE MODEL statement creates 10,000 records and 5 columns, then the number of cells it creates is 50,000.

In some cases, the training data produced by the SELECT query of the CREATE MODEL exceeds the MAX\$1CELLS limit that you provided (or the default 1 million if you didn't provide a limit). In these cases, CREATE MODEL randomly chooses approximately MAX\$1CELLS (that is the “number of columns” records from the training dataset). CREATE MODEL then performs training using these randomly chosen tuples. The random sampling ensures that the reduced training dataset doesn't have any bias. Thus, by setting the MAX\$1CELLS, you can control your training costs.

When using the CREATE MODEL statement, you can use the MAX\$1CELLS and MAX\$1RUNTIME options to control the costs, time, and potential model accuracy. 

MAX\$1RUNTIME specifies the maximum amount of time the training can take in SageMaker AI when the AUTO ON or OFF option is used. Training jobs often complete sooner than MAX\$1RUNTIME, depending on the size of the dataset. After a model is trained, Amazon Redshift does additional work in the background to compile and install your models in your cluster. Thus, CREATE MODEL can take longer than MAX\$1RUNTIME to complete. However, MAX\$1RUNTIME limits the amount of computation and time used in SageMaker AI to train your model. You can check the status of your model at any time using SHOW MODEL.

When you run CREATE MODEL with AUTO ON, Amazon Redshift ML uses SageMaker AI Autopilot to automatically and intelligently explore different models (or candidates) to find the best one. MAX\$1RUNTIME limits the amount of time and computation spent. If MAX\$1RUNTIME is set too low, there might not be enough time to explore even one candidate. If you see the error "Autopilot candidate has no models," rerun the CREATE MODEL with a larger MAX\$1RUNTIME value. For more information about this parameter, see [MaxAutoMLJobRuntimeInSeconds](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_AutoMLJobCompletionCriteria.html) in the *Amazon SageMaker AI API Reference*.

When you run CREATE MODEL with AUTO OFF, MAX\$1RUNTIME corresponds to a limit on how long the training job is run in SageMaker AI. Training jobs often complete sooner, depending on the size of the dataset and other parameters used, such as num\$1rounds in MODEL\$1TYPE XGBOOST.

You can also control costs or reduce training time by specifying a smaller MAX\$1CELLS value when you run CREATE MODEL. A *cell* is an entry in the database. Each row corresponds to as many cells as there are columns, which can be of fixed or varying width. MAX\$1CELLS limits the number of cells, and thus the number of training examples used to train your model. By default, MAX\$1CELLS is set to 1 million cells. Reducing MAX\$1CELLS reduces the number of rows from the result of the SELECT query in CREATE MODEL that Amazon Redshift exports and sends to SageMaker AI to train a model. Reducing MAX\$1CELLS thus reduces the size of the dataset used to train models both with AUTO ON and AUTO OFF. This approach helps reduce the costs and time to train models. To see information about training and billing times of a specific training job, choose **Training jobs** in Amazon SageMaker AI.

Increasing MAX\$1RUNTIME and MAX\$1CELLS often improves model quality by allowing SageMaker AI to explore more candidates. This way, SageMaker AI can take more time to train each candidate and use more data to train better models. If you want faster iteration or exploration of your dataset, use lower MAX\$1RUNTIME and MAX\$1CELLS. If you want improved accuracy of models, use higher MAX\$1RUNTIME and MAX\$1CELLS.

For more information about costs associated with various cell numbers and free trial details, see [Amazon Redshift pricing](https://aws.amazon.com/redshift/pricing).

## Costs for using Amazon Redshift ML with Amazon Bedrock
<a name="cost_sm"></a>

Using Amazon Redshift ML with Amazon Bedrock incurs additional costs. For more information, see [Amazon Bedrock Pricing](https://aws.amazon.com/bedrock/pricing/).

# Getting started with Amazon Redshift ML
<a name="getting-started-machine-learning"></a>

Amazon Redshift ML makes it easy for SQL users to create, train, and deploy machine learning models using familiar SQL commands. With Amazon Redshift ML, you can use your data in your Redshift cluster to train models with Amazon SageMaker AI. Later, the models are localized and predictions can be made within an Amazon Redshift database. Amazon Redshift ML currently supports the machine learning algorithms: XGBoost (AUTO ON and OFF) and multilayer perceptron (AUTO ON), K-Means (AUTO OFF), and Linear Learner.

**Topics**
+ [Cluster and configure setup for Amazon Redshift ML administration](#admin-setup)
+ [Using model explainability with Amazon Redshift ML](#clarify)
+ [Amazon Redshift ML probability metrics](#probability_metrics)

## Cluster and configure setup for Amazon Redshift ML administration
<a name="admin-setup"></a>

Before you work with Amazon Redshift ML, complete the cluster setup and configure permissions for using Amazon Redshift ML.

### Cluster setup for using Amazon Redshift ML
<a name="cluster-setup"></a>

Before you work with Amazon Redshift ML, complete the following prerequisites.

As an Amazon Redshift administrator, do the following one-time setup for using Amazon Redshift provisioned clusters. For using Amazon Redshift ML with Amazon Redshift Serverless, see [Getting started with Amazon Redshift Serverless](https://docs.aws.amazon.com/redshift/latest/gsg/new-user-serverless.html).

To perform one-time cluster setup for Amazon Redshift ML 

1. Create a Redshift cluster using the AWS Management Console or the AWS Command Line Interface (AWS CLI). Make sure to attach the AWS Identity and Access Management (IAM) policy while creating the cluster. For more information about permissions required to use Amazon Redshift ML with Amazon SageMaker AI, see [ Permissions required to use Amazon Redshift machine learning (ML)](https://docs.aws.amazon.com/redshift/latest/mgmt/redshift-iam-access-control-identity-based.html#iam-permission-ml?) 

1. Create the IAM role required for using Amazon Redshift ML in one of the following ways:
   + To use SageMaker AI with Amazon Redshift ML, create an IAM role with `AmazonS3FullAccess` and `AmazonSageMakerFullAccess` policies. If you plan to also create Forecast models, attach the `AmazonForecastFullAccess` policy to your role as well.
   + To use Amazon Bedrock with Amazon Redshift ML, create an IAM role with `AmazonS3FullAccess` and `AmazonBedrockFullAccess` policies.
   + We recommend that you create an IAM role through the Amazon Redshift console that has the `AmazonRedshiftAllCommandsFullAccess` policy with permissions to run SQL commands, such as CREATE MODEL. Amazon Redshift uses a seamless API-based mechanism to programmatically create IAM roles in your AWS account on your behalf. Amazon Redshift automatically attaches existing AWS managed policies to the IAM role. This approach means that you can stay within the Amazon Redshift console and don't have to switch to the IAM console for role creation. For more information, see [Creating an IAM role as default for Amazon Redshift](https://docs.aws.amazon.com/redshift/latest/mgmt/default-iam-role.html).

     When an IAM role is created as the default for your cluster, include `redshift` as part of the resource name or use a Redshift-specific tag to tag those resources.

     To use Amazon Bedrock foundation models, add the following section:

     ```
     // Required section if you use Bedrock models.
     {
        "Effect": "Allow",
        "Action": "bedrock:InvokeModel",
        "Resource": [
            "arn:aws:bedrock:<region>::foundation-model/*"
        ]
     }
     ```
   + If you want to create an IAM role with a more restrictive policy, you can use the policy following. You can also modify this policy to meet your needs.

     The Amazon S3 bucket `redshift-downloads/redshift-ml/` is the location where the sample data used for other steps and examples is stored. You can remove it if you don't need to load data from Amazon S3. Or, replace it with other Amazon S3 buckets that you use to load data into Amazon Redshift. 

     The *`your-account-id`*, *`your-role`*, and `amzn-s3-demo-bucket` values are the ones that you specify as part of your CREATE MODEL command.

     (Optional) Use the AWS KMS keys section of the sample policy if you specify an AWS KMS key while using Amazon Redshift ML. The *`your-kms-key`* value is the key that you use as part of your CREATE MODEL command.

------
#### [ JSON ]

****  

     ```
     {
         "Version":"2012-10-17",		 	 	 
         "Statement": [
             {
                 "Effect": "Allow",
                 "Action": [
                     "cloudwatch:PutMetricData",
                     "ecr:BatchCheckLayerAvailability",
                     "ecr:BatchGetImage",
                     "ecr:GetAuthorizationToken",
                     "ecr:GetDownloadUrlForLayer",
                     "logs:CreateLogGroup",
                     "logs:CreateLogStream",
                     "logs:DescribeLogStreams",
                     "logs:PutLogEvents",
                     "sagemaker:*Job*",
                     "sagemaker:AddTags",
                     "sagemaker:CreateModel",
                     "sagemaker:CreateEndpoint",
                     "sagemaker:CreateEndpointConfig",
                     "sagemaker:DeleteEndpoint",
                     "sagemaker:DeleteEndpointConfig",
                     "sagemaker:DeleteModel"
                 ],
                 "Resource": "*"
             },
             {
                 "Effect": "Allow",
                 "Action": [
                     "iam:PassRole",
                     "s3:AbortMultipartUpload",
                     "s3:GetObject",
                     "s3:DeleteObject",
                     "s3:PutObject"
                 ],
                 "Resource": [
                     "arn:aws:iam::111122223333:role/<your-role>",
                     "arn:aws:s3:::amzn-s3-demo-bucket/*",
                     "arn:aws:s3:::redshift-downloads/*"
                 ]
             },
             {
                 "Effect": "Allow",
                 "Action": [
                     "s3:GetBucketLocation",
                     "s3:ListBucket"
                 ],
                 "Resource": [
                     "arn:aws:s3:::amzn-s3-demo-bucket",
                     "arn:aws:s3:::redshift-downloads"
                 ]
             },
             {
                 "Effect": "Allow",
                 "Action": [
                     "kms:CreateGrant",
                     "kms:Decrypt",
                     "kms:DescribeKey",
                     "kms:Encrypt",
                     "kms:GenerateDataKey*"
                  ],
                  "Resource": [
                     "arn:aws:kms:us-east-1:111122223333:key/<your-kms-key>"
                  ]
             }
         ]
     }
     ```

------

1. To allow Amazon Redshift and SageMaker AI to assume the role to interact with other services, add the following trust policy to the IAM role.

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Effect": "Allow",
         "Principal": {
           "Service": [
             "redshift.amazonaws.com",
             "sagemaker.amazonaws.com",
             "forecast.amazonaws.com"
           ]
         },
         "Action": "sts:AssumeRole"
       }
     ]
   }
   ```

------

1. (Optional) Create an Amazon S3 bucket and an AWS KMS key. These are for Amazon Redshift to use to store the training data sent to Amazon SageMaker AI and receive the trained model from Amazon SageMaker AI.

1. (Optional) Create different combinations of IAM roles and Amazon S3 buckets for controlling access to different user groups.

1. When you turn on enhanced VPC routing, traffic between Redshift ML and your S3 bucket goes through your private VPC. For more information about VPC routing, see [Enhanced VPC routing in Amazon Redshift](https://docs.aws.amazon.com/redshift/latest/mgmt/enhanced-vpc-routing.html).

   For more information about permissions required to specify a private VPC for your hyperparameter tuning job, see [Permissions required to use Amazon Redshift ML with Amazon SageMaker AI](https://docs.aws.amazon.com/redshift/latest/mgmt/redshift-iam-access-control-identity-based.html).
**Note**  
Inference calls made to remote SageMaker AI models do not go through your VPC.

For information on how to use the CREATE MODEL statement to start creating models for different use cases, see [CREATE MODEL](r_CREATE_MODEL.md).

### Managing permissions and ownership
<a name="permissions-ownership"></a>

Just as with other database objects, such as tables or functions, Amazon Redshift binds creating and using ML models to access control mechanisms. There are separate permissions for creating a model that runs prediction functions. 

The following examples use two user groups, `retention_analyst_grp` (model creator) and `marketing_analyst_grp` (model user) to illustrate how Amazon Redshift manages access control. The retention analyst creates machine learning models that the other set of users can use through acquired permissions. 

A superuser can GRANT USER or GROUP permission to create machine learning models using the following statement.

```
GRANT CREATE MODEL TO GROUP retention_analyst_grp;
```

Users or groups with this permission can create a model in any schema in the cluster if a user has the usual CREATE permission on the SCHEMA. The machine learning model is part of the schema hierarchy in a similar way to tables, views, procedures, and user-defined functions. 

Assuming a schema `demo_ml` already exists, grant the two user groups the permission on the schema as follows.

```
GRANT CREATE, USAGE ON SCHEMA demo_ml TO GROUP retention_analyst_grp;
```

```
GRANT USAGE ON SCHEMA demo_ml TO GROUP marketing_analyst_grp;
```

To let other users use your machine learning inference function, grant the EXECUTE permission. The following example uses the EXECUTE permission to grant the marketing\$1analyst\$1grp GROUP the permission to use the model.

```
GRANT EXECUTE ON MODEL demo_ml.customer_churn_auto_model TO GROUP marketing_analyst_grp;
```

Use the REVOKE statement with CREATE MODEL and EXECUTE to revoke those permissions from users or groups. For more information on permission control commands, see [GRANT](r_GRANT.md) and [REVOKE](r_REVOKE.md).

## Using model explainability with Amazon Redshift ML
<a name="clarify"></a>

With model explainability in Amazon Redshift ML, you use feature importance values to help understand how each attribute in your training data contributes to the predicted result. 

Model explainability helps improve your machine learning (ML) models by explaining the predictions that your models make. Model explainability helps explain how these models make predictions using a feature attribution approach. 

Amazon Redshift ML incorporates model explainability to provide model explanation functionality to Amazon Redshift ML users. For more information about model explainability, see [What Is Fairness and Model Explainability for Machine Learning Predictions?](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-fairness-and-explainability.html) in the *Amazon SageMaker AI Developer Guide*.

Model explainability also monitors the inferences that models make in production for feature attribution drift. It also provides tools to help you generate model governance reports that you can use to inform risk and compliance teams, and external regulators.

When you specify the AUTO ON or AUTO OFF option when using the CREATE MODEL statement, after the model training job finishes, SageMaker AI creates the explanation output. You can use the EXPLAIN\$1MODEL function to query the explainability report in a JSON format. For more information, see [Machine learning functions](ml-function.md).

## Amazon Redshift ML probability metrics
<a name="probability_metrics"></a>

 In supervised learning problems, class labels are outcomes of predictions that use the input data. For example, if you're using a model to predict whether a customer would resubscribe to a streaming service, possible labels are likely and unlikely. Redshift ML provides the capability of probability metrics, which assign a probability to each label to indicate its likelihood. This helps you make more informed decisions based on the predicted outcomes. In Amazon Redshift ML, probability metrics are available when creating AUTO ON models with a problem type of either binary classification or multiclass classification. If you omit the AUTO ON parameter, Redshift ML assumes that the model should have AUTO ON. 

### Create the model
<a name="probability_metrics_create_model"></a>

 When creating a model, Amazon Redshift automatically detects the model type and problem type. If it is a classification problem, Redshift automatically creates a second inference function that you can use to output probabilities relative to each label. This second inference function's name is your specified inference function name followed by the string `_probabilities`. For example, if you name your inference function as `customer_churn_predict`, then the second inference function's name is `customer_churn_predict_probabilities`. You can then query this function to get the probabilities of each label. 

```
CREATE MODEL customer_churn_model
FROM customer_activity
    PROBLEM_TYPE BINARY_CLASSIFICATION
TARGET churn
FUNCTION customer_churn_predict
IAM_ROLE {default}
AUTO ON
SETTINGS ( S3_BUCKET 'amzn-s3-demo-bucket'
```

### Get probabilities
<a name="probability_metrics_create_model_get_probability"></a>

 Once the probability function is ready, running the command returns a [SUPER type](https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html) that contains arrays of the returned probabilities and their associated labels. For example, the result `"probabilities" : [0.7, 0.3], "labels" : ["False.", "True."]` means that the False label has a probability of 0.7, and the True label has a probability of 0.3. 

```
SELECT customer_churn_predict_probabilities(Account_length, Area_code, 
            VMail_message, Day_mins, Day_calls, Day_charge,Eve_mins, Eve_calls, 
            Eve_charge, Night_mins, Night_calls, Night_charge,Intl_mins, Intl_calls, 
            Intl_charge, Cust_serv_calls) 
FROM customer_activity;
            
customer_churn_predict_probabilities
 --------------------
 {"probabilities" : [0.7, 0.3], "labels" : ["False.", "True."]} 
 {"probabilities" : [0.8, 0.2], "labels" : ["False.", "True."]}
 {"probabilities" : [0.75, 0.25], "labels" : ["True.", "False"]}
```

 The probabilities and labels arrays are always sorted by their probabilities in descending order. You can write a query to return just the predicted label with the highest probability by unnesting the SUPER returned results of the probability function. 

```
SELECT prediction.labels[0], prediction.probabilities[0]
            FROM (SELECT customer_churn_predict_probabilities(Account_length, Area_code, 
            VMail_message, Day_mins, Day_calls, Day_charge,Eve_mins, Eve_calls, 
            Eve_charge, Night_mins, Night_calls, Night_charge,Intl_mins, Intl_calls, 
            Intl_charge, Cust_serv_calls) AS prediction
FROM customer_activity);

  labels   | probabilities
-----------+--------------
 "False."  | 0.7
 "False."  | 0.8
 "True."   | 0.75
```

To make the queries simpler, you can store the results of the prediction function in a table.

```
CREATE TABLE churn_auto_predict_probabilities AS 
             (SELECT customer_churn_predict_probabilities(Account_length, Area_code, 
             VMail_message, Day_mins, Day_calls, Day_charge,Eve_mins, Eve_calls, 
             Eve_charge, Night_mins, Night_calls, Night_charge,Intl_mins,
             Intl_calls, Intl_charge, Cust_serv_calls) AS prediction
FROM customer_activity);
```

You can query the table with the results to return only predictions that have a probability higher than 0.7.

```
SELECT prediction.labels[0], prediction.probabilities[0]
FROM churn_auto_predict_probabilities
WHERE prediction.probabilities[0] > 0.7;

  labels   | probabilities
-----------+--------------
 "False."  | 0.8
 "True."   | 0.75
```

Using index notation, you can get the probability of a specific label. The following example returns probabilities of all the `True.` labels.

```
SELECT label, index, p.prediction.probabilities[index]
FROM churn_auto_predict_probabilities p, p.prediction.labels AS label AT index
WHERE label='True.';

  label  | index | probabilities
---------+-------+---------------
 "True." |     0 | 0.3
 "True." |     0 | 0.2
 "True." |     0 | 0.75
```

 The following example returns all rows that have a `True`. label with a probability greater than 0.7, indicating that the customer is likely to churn. 

```
SELECT prediction.labels[0], prediction.probabilities[0]
FROM churn_auto_predict_probabilities
WHERE prediction.probabilities[0] > 0.7 AND prediction.labels[0] = "True.";

labels     | probabilities
-----------+--------------
 "True."   | 0.75
```

# Tutorials for Amazon Redshift ML
<a name="tutorials_for_amazon_redshift_ml"></a>

You can use Amazon Redshift ML to train machine learning models using SQL statements, and then invoke the models in SQL queries for prediction. Machine learning in Amazon Redshift trains a model with one SQL command. Amazon Redshift automatically launches a training job in Amazon SageMaker AI and generates a model. Once a model is created, you can perform predictions in Amazon Redshift using the model’s prediction function.

Follow the steps in these tutorials to learn about Amazon Redshift ML features:
+ [Tutorial: Building customer churn models](tutorial_customer_churn.md) – In this tutorial, you use Amazon Redshift ML to create a customer churn model with the CREATE MODEL command, and run prediction queries for user scenarios. Then, you implement queries using the SQL function that the CREATE MODEL command generates.
+ [Tutorial: Building K-means clustering models](tutorial_k-means_clustering.md) – In this tutorial, you use Amazon Redshift ML to create, train, and deploy a machine learning model based on the [K-means algorithm](url-sm-dev;k-means.html).
+ [Tutorial: Building multi-class classification models](tutorial_multi-class_classification.md) – In this tutorial, you use Amazon Redshift ML to create a machine learning model that solves multi-class classification problems. The multi-class classification algorithm classifies data points into one of three or more classes. Then, you implement queries using the SQL function that the CREATE MODEL command generates. 
+ [Tutorial: Building XGBoost models](tutorial_xgboost.md) – In this tutorial, you create a model with data from Amazon S3 and run prediction queries with the model using Amazon Redshift ML. The XGBoost algorithm is an optimized implementation of the gradient boosted trees algorithm. 
+ [Tutorial: Building regression models](tutorial_regression.md) – In this tutorial, you use Amazon Redshift ML to create a machine learning regression model and run prediction queries on the model. Regression models allow you to predict numerical outcomes, such as the price of a house, or how many people will use a city’s bike rental service. 
+ [Tutorial: Building regression models with linear learner](tutorial_linear_learner_regression.md) – In this tutorial, you create a linear learner model with data from Amazon S3 and run prediction queries with the model using Amazon Redshift ML. The SageMaker AI linear learner algorithm solves either regression or multi-class classification problems. 
+ [Tutorial: Building multi-class classification models with linear learner](tutorial_linear_learner_multi-class_classification.md) – In this tutorial, you create a linear learner model with data from Amazon S3, and then run prediction queries with the model using Amazon Redshift ML. The SageMaker AI linear learner algorithm solves either regression or classification problems. 

# Tutorial: Building customer churn models
<a name="tutorial_customer_churn"></a>

In this tutorial, you use Amazon Redshift ML to create a customer churn model with the CREATE MODEL command, and run prediction queries for user scenarios. Then, you implement queries using the SQL function that the CREATE MODEL command generates.

You can use a simple CREATE MODEL command to export training data, train a model, import the model, and prepare an Amazon Redshift prediction function. Use the CREATE MODEL statement to specify training data either as a table or SELECT statement.

This example uses historical information to construct a machine learning model of a mobile operator’s customer churn. First, SageMaker AI trains your machine learning model and then tests your model using the profile information of an arbitrary customer. After the model is validated, Amazon SageMaker AI deploys the model and the prediction function to Amazon Redshift. You can use the prediction function to predict whether a customer is going to churn or not.

## Use case examples
<a name="tutorial_customer_churn_tasks"></a>

You can solve other binary classification problems using Amazon Redshift ML, such as predicting if a sales lead will close or not. You could also predict whether a financial transaction is fraudulent or not.

**Tasks**
+ Prerequisites
+ Step 1: Load the data from Amazon S3 to Amazon Redshift
+ Step 2: Create the machine learning model
+ Step 3: Perform predictions with the model

## Prerequisites
<a name="tutorial_customer_churn_prereqs"></a>

To complete this tutorial, you must have the following prerequisites:
+ You must set up an Amazon Redshift cluster for Amazon Redshift ML. To do so, use the documentation for [Cluster and configure setup for Amazon Redshift ML administration.](https://docs.aws.amazon.com/redshift/latest/dg/admin-setup.html)
+ The Amazon Redshift cluster that you use to create the model, and the Amazon S3 bucket that you use to stage the training data and store the model artifacts must be in the same AWS Region.
+ To download the SQL commands and the sample dataset used in this documentation, do one of the following:
  + Download the [SQL commands](https://s3.amazonaws.com/redshift-downloads/redshift-ml/tutorial-scripts/redshift-ml-tutorial.sql), [Customer activity file](https://s3.amazonaws.com/redshift-downloads/redshift-ml/customer_activity/customer_activity.csv), and [Abalone file](https://s3.amazonaws.com/redshift-downloads/redshift-ml/abalone_xg/abalone.csv).
  + Using the AWS CLI for Amazon S3, run the following command. You can use your own target path.

    ```
    aws s3 cp s3://redshift-downloads/redshift-ml/tutorial-scripts/redshift-ml-tutorial.sql </target/path>
    aws s3 cp s3://redshift-downloads/redshift-ml/customer_activity/customer_activity.csv </target/path>
    aws s3 cp s3://redshift-downloads/redshift-ml/abalone_xgb/abalone_xgb.csv </target/path>
    ```

## Step 1: Load the data from Amazon S3 to Amazon Redshift
<a name="tutorial_customer_churn_step_load"></a>

Use the [Amazon Redshift query editor v2](https://docs.aws.amazon.com/redshift/latest/mgmt/query-editor-v2-using.html) to edit and run queries and visualize results.

Running the following queries creates a table named `customer_activity` and ingests the sample dataset from Amazon S3.

```
DROP TABLE IF EXISTS customer_activity;

CREATE TABLE customer_activity (
state varchar(2),
account_length int,
area_code int,
phone varchar(8),
intl_plan varchar(3),
vMail_plan varchar(3),
vMail_message int,
day_mins float,
day_calls int,
day_charge float,
total_charge float,
eve_mins float,
eve_calls int,
eve_charge float,
night_mins float,
night_calls int,
night_charge float,
intl_mins float,
intl_calls int,
intl_charge float,
cust_serv_calls int,
churn varchar(6),
record_date date
);

COPY customer_activity
FROM 's3://redshift-downloads/redshift-ml/customer_activity/'
REGION 'us-east-1' IAM_ROLE default
FORMAT AS CSV IGNOREHEADER 1;
```

## Step 2: Create the machine learning model
<a name="tutorial_customer_churn_step_create_model"></a>

Churn is our target input in this model. All other inputs for the model are attributes that help to create a function to predict churn.

The following example uses the CREATE MODEL operation to deliver a model that predicts whether a customer will be active, using inputs such as the customer’s age, postal code, spending, and cases. In the following example, replace amzn-s3-demo-bucket with your own Amazon S3 bucket.

```
CREATE MODEL customer_churn_auto_model
FROM
    (
      SELECT state,
             account_length,
             area_code,
             total_charge/account_length AS average_daily_spend,
             cust_serv_calls/account_length AS average_daily_cases,
             churn
      FROM customer_activity
      WHERE  record_date < '2020-01-01'
     )
TARGET churn FUNCTION ml_fn_customer_churn_auto
IAM_ROLE default SETTINGS (
  S3_BUCKET 'amzn-s3-demo-bucket'
);
```

The SELECT query in the preceding example creates the training data. The TARGET clause specifies which column is the machine learning label that the CREATE MODEL operation uses to learn how to predict. The target column “churn” indicates whether the customer still has an active membership or has suspended the membership. The S3\$1BUCKET field is the name of the Amazon S3 bucket that you previously created. The Amazon S3 bucket is used to share training data and artifacts between Amazon Redshift and Amazon SageMaker AI. The remaining columns are the features that are used for the prediction.

For a summary of the syntax and features of a basic use case of the CREATE MODEL command, see [Simple CREATE MODEL](https://docs.aws.amazon.com/redshift/latest/dg/r_create_model_use_cases.html#r_simple_create_model).

### Add permissions for server-side encryption (optional)
<a name="tutorial_customer_churn_encryption"></a>

Amazon Redshift by default uses Amazon SageMaker AI Autopilot for training. In particular, Amazon Redshift securely exports the training data to the customer-specified Amazon S3 bucket. If you don’t specify a `KMS_KEY_ID`, then the data is encrypted using server-side encryption SSE-S3 by default.

When you encrypt your input using server-side encryption with a AWS KMS managed key (SSE-MMS), then add the following permissions:

```
{
    "Effect": "Allow",
    "Action": [
    "kms:Encrypt"
    "kms:Decrypt"
    ]
}
```

For more information about Amazon SageMaker AI roles, see [Amazon SageMaker AI roles](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) in the *Amazon SageMaker AI Developer Guide*.

### Check the status of model training (optional)
<a name="tutorial_customer_churn_check_status"></a>

You can use the SHOW MODEL command to know when your model is ready.

Use the following operation to check the status of the model.

```
SHOW MODEL customer_churn_auto_model;
```

The following is an example of the output of the previous operation.

```
+--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
|           Key            |                                                                             Value                                                                             |
+--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
|        Model Name        |                                                                   customer_churn_auto_model                                                                   |
|       Schema Name        |                                                                            public                                                                             |
|          Owner           |                                                                            awsuser                                                                            |
|      Creation Time       |                                                                   Tue, 14.06.2022 17:15:52                                                                    |
|       Model State        |                                                                           TRAINING                                                                            |
|                          |                                                                                                                                                               |
|      TRAINING DATA:      |                                                                                                                                                               |
|          Query           | SELECT STATE, ACCOUNT_LENGTH, AREA_CODE, TOTAL_CHARGE / ACCOUNT_LENGTH AS AVERAGE_DAILY_SPEND, CUST_SERV_CALLS / ACCOUNT_LENGTH AS AVERAGE_DAILY_CASES, CHURN |
|                          |                                                                    FROM CUSTOMER_ACTIVITY                                                                     |
|                          |                                                               WHERE RECORD_DATE < '2020-01-01'                                                                |
|      Target Column       |                                                                             CHURN                                                                             |
|                          |                                                                                                                                                               |
|       PARAMETERS:        |                                                                                                                                                               |
|        Model Type        |                                                                             auto                                                                              |
|       Problem Type       |                                                                                                                                                               |
|        Objective         |                                                                                                                                                               |
|     AutoML Job Name      |                                                                redshiftml-20220614171552640901                                                                |
|      Function Name       |                                                                   ml_fn_customer_churn_auto                                                                   |
|   Function Parameters    |                                            state account_length area_code average_daily_spend average_daily_cases                                             |
| Function Parameter Types |                                                                 varchar int4 int4 float8 int4                                                                 |
|         IAM Role         |                                                                     default-aws-iam-role                                                                      |
|        S3 Bucket         |                                                                        amzn-s3-demo-bucket                                                                     |
|       Max Runtime        |                                                                             5400                                                                              |
+--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
```

When the model training is complete, the `model_state` variable becomes `Model is Ready`, and the prediction function becomes available.

## Step 3: Perform predictions with the model
<a name="tutorial_customer_churn_step_perform_predictions"></a>

You can use SQL statements to view the predictions made by the prediction model. In this example, the prediction function created by the CREATE MODEL operation is named `ml_fn_customer_churn_auto`. The input arguments for the prediction function correspond to the types of the features, such as varchar for the `state` and integer for `account_length`. The output of the prediction function is the same type as the TARGET column of the CREATE MODEL statement.

1. You trained the model on data from before 2020-01-01, so now you use the prediction function on the testing set. The following query displays the predictions of whether customers who signed up after 2020-01-01 will go through churn or not.

   ```
   SELECT
       phone,
       ml_fn_customer_churn_auto(
           state,
           account_length,
           area_code,
           total_charge / account_length,
           cust_serv_calls / account_length
       ) AS active
   FROM
       customer_activity
   WHERE
       record_date > '2020-01-01';
   ```

1. The following example uses the same prediction function for a different use case. In this case, Amazon Redshift predicts the proportion of churners and non-churners among customers from different states where the record date is greater than 2020-01-01.

   ```
   WITH predicted AS (
       SELECT
           state,
           ml_fn_customer_churn_auto(
               state,
               account_length,
               area_code,
               total_charge / account_length,
               cust_serv_calls / account_length
           ) :: varchar(6) AS active
       FROM
           customer_activity
       WHERE
           record_date > '2020-01-01'
   )
   SELECT
       state,
       SUM(
           CASE
               WHEN active = 'True.' THEN 1
               ELSE 0
           END
       ) AS churners,
       SUM(
           CASE
               WHEN active = 'False.' THEN 1
               ELSE 0
           END
       ) AS nonchurners,
       COUNT(*) AS total_per_state
   FROM
       predicted
   GROUP BY
       state
   ORDER BY
       state;
   ```

1. The following example uses the prediction function for the use case of predicting the percentage of customers who churn in a state. In this case, Amazon Redshift predicts the churn percentage where the record date is greater than 2020-01-01.

   ```
   WITH predicted AS (
       SELECT
           state,
           ml_fn_customer_churn_auto(
               state,
               account_length,
               area_code,
               total_charge / account_length,
               cust_serv_calls / account_length
           ) :: varchar(6) AS active
       FROM
           customer_activity
       WHERE
           record_date > '2020-01-01'
   )
   SELECT
       state,
       CAST((CAST((SUM(
           CASE
               WHEN active = 'True.' THEN 1
               ELSE 0
           END
       )) AS FLOAT) / CAST(COUNT(*) AS FLOAT)) AS DECIMAL (3, 2)) AS pct_churn,
       COUNT(*) AS total_customers_per_state
   FROM
       predicted
   GROUP BY
       state
   ORDER BY
       3 DESC;
   ```

## Related topics
<a name="tutorial_customer_churn_related_topics"></a>

For more information about Amazon Redshift ML, see the following documentation:
+ [Costs for using Amazon RedshiftML](https://docs.aws.amazon.com/redshift/latest/dg/cost.html)
+ [CREATE MODEL command](https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_MODEL.html)
+ [EXPLAIN\$1MODEL function](https://docs.aws.amazon.com/redshift/latest/dg/r_explain_model_function.html)

For more information about machine learning, see the following documentation:
+ [Machine learning overview](https://docs.aws.amazon.com/redshift/latest/dg/machine_learning_overview.html)
+ [Machine learning for novices and experts](https://docs.aws.amazon.com/redshift/latest/dg/novice_expert.html)
+ [What Is Fairness and Model Explainability for Machine Learning Predictions?](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-fairness-and-explainability.html)

# Tutorial: Building K-means clustering models
<a name="tutorial_k-means_clustering"></a>

In this tutorial, you use Amazon Redshift ML to create, train, and deploy a machine learning model based on the [K-means algorithm](url-sm-dev;k-means.html). This algorithm solves clustering problems where you want to discover groupings in the data. K-means helps in grouping data that has not been labeled yet. To learn more about K-means clustering, see [How K-means Clustering Works](https://docs.aws.amazon.com/sagemaker/latest/dg/algo-kmeans-tech-notes.html)in the Amazon SageMaker AI Developer Guide.

You will use a CREATE MODEL operation to create a K-means model from a Amazon Redshift cluster. You can use a CREATE MODEL command to export training data, train a model, import the model, and prepare an Amazon Redshift prediction function. Use the CREATE MODEL operation to specify training data either as a table or a SELECT statement.

In this tutorial, you use K-means on the [Global Database of Events, Language, and Tone (GDELT)](https://aws.amazon.com/public-datasets/gdelt/) dataset, which monitors world news across the world, and the data is stored for every second of every day. K-means will group events that have similar tone, actors, or locations. The data is stored as multiple files on Amazon Simple Storage Service, in two different folders. The folders are historical, which cover the years 1979–2013, and daily updates, which cover the years 2013 and later. For this example, we use the historical format and bring in 1979 data.

## Use case examples
<a name="tutorial_k-means_clustering_tasks"></a>

You can solve other clustering problems with Amazon Redshift ML, such as grouping customers who have similar viewing habits on a streaming service. You could also use Redshift ML to predict the optimal number of shipping centers for a delivery service.

**Tasks**
+ Prerequisites
+ Step 1: Load the data from Amazon S3 to Amazon Redshift
+ Step 2: Create the machine learning model
+ Step 3: Perform predictions with the model

## Prerequisites
<a name="tutorial_k-means_clustering_prereqs"></a>

To complete this tutorial, you must complete the [Administrative setup](https://docs.aws.amazon.com/redshift/latest/dg/admin-setup.html) for Amazon Redshift ML.

## Step 1: Load the data from Amazon S3 to Amazon Redshift
<a name="tutorial_k-means_clustering_step_load"></a>

1. Use the [Amazon Redshift query editor v2](https://docs.aws.amazon.com/redshift/latest/mgmt/query-editor-v2-using.html) to run the following query. The query drops the `gdelt_data` table in the public schema if it exists and creates a table of the same name in the public schema.

   ```
   DROP TABLE IF EXISTS gdelt_data CASCADE;
   
   CREATE TABLE gdelt_data (
       GlobalEventId bigint,
       SqlDate bigint,
       MonthYear bigint,
       Year bigint,
       FractionDate double precision,
       Actor1Code varchar(256),
       Actor1Name varchar(256),
       Actor1CountryCode varchar(256),
       Actor1KnownGroupCode varchar(256),
       Actor1EthnicCode varchar(256),
       Actor1Religion1Code varchar(256),
       Actor1Religion2Code varchar(256),
       Actor1Type1Code varchar(256),
       Actor1Type2Code varchar(256),
       Actor1Type3Code varchar(256),
       Actor2Code varchar(256),
       Actor2Name varchar(256),
       Actor2CountryCode varchar(256),
       Actor2KnownGroupCode varchar(256),
       Actor2EthnicCode varchar(256),
       Actor2Religion1Code varchar(256),
       Actor2Religion2Code varchar(256),
       Actor2Type1Code varchar(256),
       Actor2Type2Code varchar(256),
       Actor2Type3Code varchar(256),
       IsRootEvent bigint,
       EventCode bigint,
       EventBaseCode bigint,
       EventRootCode bigint,
       QuadClass bigint,
       GoldsteinScale double precision,
       NumMentions bigint,
       NumSources bigint,
       NumArticles bigint,
       AvgTone double precision,
       Actor1Geo_Type bigint,
       Actor1Geo_FullName varchar(256),
       Actor1Geo_CountryCode varchar(256),
       Actor1Geo_ADM1Code varchar(256),
       Actor1Geo_Lat double precision,
       Actor1Geo_Long double precision,
       Actor1Geo_FeatureID bigint,
       Actor2Geo_Type bigint,
       Actor2Geo_FullName varchar(256),
       Actor2Geo_CountryCode varchar(256),
       Actor2Geo_ADM1Code varchar(256),
       Actor2Geo_Lat double precision,
       Actor2Geo_Long double precision,
       Actor2Geo_FeatureID bigint,
       ActionGeo_Type bigint,
       ActionGeo_FullName varchar(256),
       ActionGeo_CountryCode varchar(256),
       ActionGeo_ADM1Code varchar(256),
       ActionGeo_Lat double precision,
       ActionGeo_Long double precision,
       ActionGeo_FeatureID bigint,
       DATEADDED bigint
   );
   ```

1. The following query loads the sample data into the `gdelt_data` table.

   ```
   COPY gdelt_data
   FROM 's3://gdelt-open-data/events/1979.csv' 
   REGION 'us-east-1'
   IAM_ROLE default 
   CSV
   DELIMITER '\t';
   ```

### Examine the training data (optional)
<a name="tutorial_k-means_clustering_examine"></a>

To see what data your model will be trained on, use the following query.

```
SELECT
    AvgTone,
    EventCode,
    NumArticles,
    Actor1Geo_Lat,
    Actor1Geo_Long,
    Actor2Geo_Lat,
    Actor2Geo_Long
FROM
    gdelt_data LIMIT 100;
```

## Step 2: Create the machine learning model
<a name="tutorial_k-means_clustering_create_model"></a>

The following example uses the CREATE MODEL command to create a model that groups the data into seven clusters. The K value is the number of clusters that your data points are divided into. The model classifies your data points into clusters where data points are more similar to each other. By clustering the data points into groups, the K-Means algorithm iteratively determines the best cluster center. The algorithm then assigns each data point to the closest cluster center. Members nearest the same cluster center belong to the same group. Members of a group are as similar as possible to other members in the same group, and as different as possible from members of other groups. The K value is subjective and depends on methods that measure the similarities among data points. You can change the K value to smooth out cluster sizes if the clusters are unevenly distributed.

In the following example, replace amzn-s3-demo-bucket with your own Amazon S3 bucket.

```
CREATE MODEL news_data_clusters
FROM
    (
        SELECT
            AvgTone,
            EventCode,
            NumArticles,
            Actor1Geo_Lat,
            Actor1Geo_Long,
            Actor2Geo_Lat,
            Actor2Geo_Long
        FROM
            gdelt_data
    ) FUNCTION news_monitoring_cluster 
    IAM_ROLE default 
    AUTO OFF 
    MODEL_TYPE KMEANS 
    PREPROCESSORS 'none' 
    HYPERPARAMETERS DEFAULT
    EXCEPT
    (K '7') 
    SETTINGS (S3_BUCKET 'amzn-s3-demo-bucket');
```

### Check the status of model training (optional)
<a name="tutorial_k-means_clustering_check_status"></a>

You can use the SHOW MODEL command to know when your model is ready.

To check the model status, use the following SHOW MODEL operation and find if the `Model State` is `Ready`.

```
SHOW MODEL NEWS_DATA_CLUSTERS;
```

When the model is ready, the output of the previous operation should show that the `Model State` is `Ready`. The following is an example of the output of the SHOW MODEL operation.

```
+--------------------------+------------------------------------------------------------------------------------------------------+
|        Model Name        |                                          news_data_clusters                                          |
+--------------------------+------------------------------------------------------------------------------------------------------+
|       Schema Name        |                                                public                                                |
|          Owner           |                                               awsuser                                                |
|      Creation Time       |                                       Fri, 17.06.2022 16:32:19                                       |
|       Model State        |                                                READY                                                 |
|        train:msd         |                                             2973.822754                                              |
|      train:progress      |                                              100.000000                                              |
|     train:throughput     |                                            237114.875000                                             |
|      Estimated Cost      |                                               0.004983                                               |
|                          |                                                                                                      |
|      TRAINING DATA:      |                                                                                                      |
|          Query           | SELECT AVGTONE, EVENTCODE, NUMARTICLES, ACTOR1GEO_LAT, ACTOR1GEO_LONG, ACTOR2GEO_LAT, ACTOR2GEO_LONG |
|                          |                                           FROM GDELT_DATA                                            |
|                          |                                                                                                      |
|       PARAMETERS:        |                                                                                                      |
|        Model Type        |                                                kmeans                                                |
|    Training Job Name     |                                redshiftml-20220617163219978978-kmeans                                |
|      Function Name       |                                       news_monitoring_cluster                                        |
|   Function Parameters    |       avgtone eventcode numarticles actor1geo_lat actor1geo_long actor2geo_lat actor2geo_long        |
| Function Parameter Types |                             float8 int8 int8 float8 float8 float8 float8                             |
|         IAM Role         |                                         default-aws-iam-role                                         |
|        S3 Bucket         |                                            amzn-s3-demo-bucket                                       |
|       Max Runtime        |                                                 5400                                                 |
|                          |                                                                                                      |
|     HYPERPARAMETERS:     |                                                                                                      |
|       feature_dim        |                                                  7                                                   |
|            k             |                                                  7                                                   |
+--------------------------+------------------------------------------------------------------------------------------------------+
```

## Step 3: Perform predictions with the model
<a name="tutorial_k-means_clustering_step_perform_predictions"></a>

### Identify the clusters
<a name="tutorial_k-means_clustering_identify_clusters"></a>

You can find discrete groupings identified in the data by your model, otherwise known as clusters. A cluster is the set of data points that is closer to its cluster center than any other cluster center. Since the K value represents the number of clusters in the model, it also represents the number of cluster centers. The following query identifies the clusters by showing the cluster associated with each `globaleventid`.

```
SELECT
    globaleventid,
    news_monitoring_cluster (
        AvgTone,
        EventCode,
        NumArticles,
        Actor1Geo_Lat,
        Actor1Geo_Long,
        Actor2Geo_Lat,
        Actor2Geo_Long
    ) AS cluster
FROM
    gdelt_data;
```

### Check the distribution of data
<a name="tutorial_k-means_clustering_check_distribution"></a>

You can check the distribution of data across clusters to see if the K value that you chose caused the data to be somewhat evenly distributed. Use the following query to determine if the data is evenly distributed across your clusters.

```
SELECT
    events_cluster,
    COUNT(*) AS nbr_events
FROM
    (
        SELECT
            globaleventid,
            news_monitoring_cluster(
                AvgTone,
                EventCode,
                NumArticles,
                Actor1Geo_Lat,
                Actor1Geo_Long,
                Actor2Geo_Lat,
                Actor2Geo_Long
            ) AS events_cluster
        FROM
            gdelt_data
    )
GROUP BY
    1;
```

Note that you can change the K value to smooth out cluster sizes if the clusters are unevenly distributed.

### Determine the cluster centers
<a name="tutorial_k-means_clustering_determine_centers"></a>

A data point is closer to its cluster center than it is to any other cluster center. Thus, finding the cluster centers helps you define the clusters.

Run the following query to determine the centers of the clusters based on the number of articles by event code.

```
SELECT
    news_monitoring_cluster (
        AvgTone,
        EventCode,
        NumArticles,
        Actor1Geo_Lat,
        Actor1Geo_Long,
        Actor2Geo_Lat,
        Actor2Geo_Long
    ) AS events_cluster,
    eventcode,
    SUM(numArticles) AS numArticles
FROM
    gdelt_data
GROUP BY
    1,
    2;
```

### Show information about data points in a cluster
<a name="tutorial_k-means_clustering_data_points_info"></a>

Use the following query to return the data for the points assigned to the fifth cluster. The selected articles must have two actors.

```
SELECT
    news_monitoring_cluster (
        AvgTone,
        EventCode,
        NumArticles,
        Actor1Geo_Lat,
        Actor1Geo_Long,
        Actor2Geo_Lat,
        Actor2Geo_Long
    ) AS events_cluster,
    eventcode,
    actor1name,
    actor2name,
    SUM(numarticles) AS totalarticles
FROM
    gdelt_data
WHERE
    events_cluster = 5
    AND actor1name <> ' '
    AND actor2name <> ' '
GROUP BY
    1,
    2,
    3,
    4
ORDER BY
    5 desc;
```

### Show data about events with actors of the same ethnic code
<a name="tutorial_k-means_clustering_show_events_data"></a>

The following query counts the number of articles written about events with a positive tone. The query also requires that the two actors have the same ethnic code and it returns which cluster each event is assigned to.

```
SELECT
    news_monitoring_cluster (
        AvgTone,
        EventCode,
        NumArticles,
        Actor1Geo_Lat,
        Actor1Geo_Long,
        Actor2Geo_Lat,
        Actor2Geo_Long
    ) AS events_cluster,
    SUM(numarticles) AS total_articles,
    eventcode AS event_code,
    Actor1EthnicCode AS ethnic_code
FROM
    gdelt_data
WHERE
    Actor1EthnicCode = Actor2EthnicCode
    AND Actor1EthnicCode <> ' '
    AND Actor2EthnicCode <> ' '
    AND AvgTone > 0
GROUP BY
    1,
    3,
    4
HAVING
    (total_articles) > 4
ORDER BY
    1,
    2 ASC;
```

## Related topics
<a name="tutorial_k-means_clustering_related_topics"></a>

For more information about Amazon Redshift ML, see the following documentation:
+ [ Costs for using Amazon Redshift ML](https://docs.aws.amazon.com/redshift/latest/dg/cost.html)
+ [CREATE MODEL operation](https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_MODEL.html)
+ [ EXPLAIN\$1MODEL function](https://docs.aws.amazon.com/redshift/latest/dg/r_explain_model_function.html)

For more information about machine learning, see the following documentation:
+ [Machine learning overview](https://docs.aws.amazon.com/redshift/latest/dg/machine_learning_overview.html)
+ [Machine learning for novices and experts](https://docs.aws.amazon.com/redshift/latest/dg/novice_expert.html)
+ [What Is Fairness and Model Explainability for Machine Learning Predictions?](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-fairness-and-explainability.html)

# Tutorial: Building multi-class classification models
<a name="tutorial_multi-class_classification"></a>

In this tutorial, you use Amazon Redshift ML to create a machine learning model that solves multi-class classification problems. The multi-class classification algorithm classifies data points into one of three or more classes. Then, you implement queries using the SQL function that the CREATE MODEL command generates.

You can use a CREATE MODEL command to export training data, train a model, import the model, and prepare an Amazon Redshift prediction function. Use the CREATE MODEL operation to specify training data either as a table or a SELECT statement.

To follow along with the tutorial, you use the public dataset [E-Commerce Sales Forecast](https://www.kaggle.com/allunia/e-commerce-sales-forecast), which includes sales data of an online UK retailer. The model you generate will target the most active customers for a special customer loyalty program. With multi-class classification, you can use the model to predict how many months a customer will be active over a 13-month period. The prediction function designates customers who are predicted to be active for 7 or more months for admission to the program.

## Use case examples
<a name="tutorial_multi-class_classification_tasks"></a>

You can solve other multi-class classification problems with Amazon Redshift ML, such as predicting the best-selling product from a product line. You could also predict which fruit an image contains, such as selecting apples or pears or oranges.

**Tasks**
+ Prerequisites
+ Step 1: Load the data from Amazon S3 to Amazon Redshift
+ Step 2: Create the machine learning model
+ Step 3: Perform predictions with the model

## Prerequisites
<a name="tutorial_multi-class_classification_prereqs"></a>

To complete this tutorial, you must complete the [Administrative setup](https://docs.aws.amazon.com/redshift/latest/dg/admin-setup.html) for Amazon Redshift ML.

## Step 1: Load the data from Amazon S3 to Amazon Redshift
<a name="tutorial_multi-class_classification_step_load"></a>

Use the [Amazon Redshift query editor v2](https://docs.aws.amazon.com/redshift/latest/mgmt/query-editor-v2-using.html) to run the following queries. These queries load the sample data into Amazon Redshift.

1. The following query creates a table named `ecommerce_sales`.

   ```
   CREATE TABLE IF NOT EXISTS ecommerce_sales (
       invoiceno VARCHAR(30),
       stockcode VARCHAR(30),
       description VARCHAR(60),
       quantity DOUBLE PRECISION,
       invoicedate VARCHAR(30),
       unitprice DOUBLE PRECISION,
       customerid BIGINT,
       country VARCHAR(25)
   );
   ```

1. The following query copies the sample data from the [E-Commerce Sales Forecast dataset](https://www.kaggle.com/allunia/e-commerce-sales-forecast) into the `ecommerce_sales`table.

   ```
   COPY ecommerce_sales
   FROM
       's3://redshift-ml-multiclass/ecommerce_data.txt' 
   IAM_ROLE default 
   DELIMITER '\t' 
   IGNOREHEADER 1 
   REGION 'us-east-1' 
   MAXERROR 100;
   ```

### Split the data
<a name="tutorial_multi-class_classification_split_data"></a>

When you create a model in Amazon Redshift ML, SageMaker AI automatically splits your data into training and test sets, so that SageMaker AI can determine the model accuracy. By manually splitting the data at this step, you will be able to verify the accuracy of the model by allocating an additional prediction set. 

Use the following SQL statement to split the data into three sets for training, validation, and prediction.

```
--creates table with all data
CREATE TABLE ecommerce_sales_data AS (
    SELECT
        t1.stockcode,
        t1.description,
        t1.invoicedate,
        t1.customerid,
        t1.country,
        t1.sales_amt,
        CAST(RANDOM() * 100 AS INT) AS data_group_id
    FROM
        (
            SELECT
                stockcode,
                description,
                invoicedate,
                customerid,
                country,
                SUM(quantity * unitprice) AS sales_amt
            FROM
                ecommerce_sales
            GROUP BY
                1,
                2,
                3,
                4,
                5
        ) t1
);

--creates training set
CREATE TABLE ecommerce_sales_training AS (
    SELECT
        a.customerid,
        a.country,
        a.stockcode,
        a.description,
        a.invoicedate,
        a.sales_amt,
        (b.nbr_months_active) AS nbr_months_active
    FROM
        ecommerce_sales_data a
        INNER JOIN (
            SELECT
                customerid,
                COUNT(
                    DISTINCT(
                        DATE_PART(y, CAST(invoicedate AS DATE)) || '-' || LPAD(
                            DATE_PART(mon, CAST(invoicedate AS DATE)),
                            2,
                            '00'
                        )
                    )
                ) AS nbr_months_active
            FROM
                ecommerce_sales_data
            GROUP BY
                1
        ) b ON a.customerid = b.customerid
    WHERE
        a.data_group_id < 80
);

--creates validation set
CREATE TABLE ecommerce_sales_validation AS (
    SELECT
        a.customerid,
        a.country,
        a.stockcode,
        a.description,
        a.invoicedate,
        a.sales_amt,
        (b.nbr_months_active) AS nbr_months_active
    FROM
        ecommerce_sales_data a
        INNER JOIN (
            SELECT
                customerid,
                COUNT(
                    DISTINCT(
                        DATE_PART(y, CAST(invoicedate AS DATE)) || '-' || LPAD(
                            DATE_PART(mon, CAST(invoicedate AS DATE)),
                            2,
                            '00'
                        )
                    )
                ) AS nbr_months_active
            FROM
                ecommerce_sales_data
            GROUP BY
                1
        ) b ON a.customerid = b.customerid
    WHERE
        a.data_group_id BETWEEN 80
        AND 90
);

--creates prediction set
CREATE TABLE ecommerce_sales_prediction AS (
    SELECT
        customerid,
        country,
        stockcode,
        description,
        invoicedate,
        sales_amt
    FROM
        ecommerce_sales_data
    WHERE
        data_group_id > 90);
```

## Step 2: Create the machine learning model
<a name="tutorial_multi-class_classification_step_create_model"></a>

In this step, you use the CREATE MODEL statement to create your machine learning model using multi-class classification. 

The following query creates the multi-class classification model with the training set using the CREATE MODEL operation. Replace amzn-s3-demo-bucket with your own Amazon S3 bucket.

```
CREATE MODEL ecommerce_customer_activity
FROM
    (
        SELECT
            customerid,
            country,
            stockcode,
            description,
            invoicedate,
            sales_amt,
            nbr_months_active
        FROM
            ecommerce_sales_training
    ) TARGET nbr_months_active FUNCTION predict_customer_activity IAM_ROLE default PROBLEM_TYPE MULTICLASS_CLASSIFICATION SETTINGS (
        S3_BUCKET 'amzn-s3-demo-bucket',
        S3_GARBAGE_COLLECT OFF
    );
```

In this query, you specify the problem type as `Multiclass_Classification`. The target that you predict for the model is `nbr_months_active`. When SageMaker AI finishes training the model, it creates the function `predict_customer_activity`, which you will use to make predictions in Amazon Redshift.

### Show the status of model training (optional)
<a name="tutorial_multi-class_classification_show_status"></a>

You can use the SHOW MODEL command to know when your model is ready.

Use the following query to return various metrics of the model, including model state and accuracy.

```
SHOW MODEL ecommerce_customer_activity;
```

When the model is ready, the output of the previous operation should show that the `Model State` is `Ready`. The following is an example of the output of the SHOW MODEL operation.

```
+--------------------------+-----------------------------------------------------------------------------------------------+
|        Model Name        |                                  ecommerce_customer_activity                                  |
+--------------------------+-----------------------------------------------------------------------------------------------+
|       Schema Name        |                                            public                                             |
|          Owner           |                                            awsuser                                            |
|      Creation Time       |                                   Fri, 17.06.2022 19:02:15                                    |
|       Model State        |                                             READY                                             |
|   Training Job Status    |                                  MaxAutoMLJobRuntimeReached                                   |
|   validation:accuracy    |                                           0.991280                                            |
|      Estimated Cost      |                                           7.897689                                            |
|                          |                                                                                               |
|      TRAINING DATA:      |                                                                                               |
|          Query           | SELECT CUSTOMERID, COUNTRY, STOCKCODE, DESCRIPTION, INVOICEDATE, SALES_AMT, NBR_MONTHS_ACTIVE |
|                          |                                 FROM ECOMMERCE_SALES_TRAINING                                 |
|      Target Column       |                                       NBR_MONTHS_ACTIVE                                       |
|                          |                                                                                               |
|       PARAMETERS:        |                                                                                               |
|        Model Type        |                                            xgboost                                            |
|       Problem Type       |                                   MulticlassClassification                                    |
|        Objective         |                                           Accuracy                                            |
|     AutoML Job Name      |                                redshiftml-20220617190215268770                                |
|      Function Name       |                                   predict_customer_activity                                   |
|   Function Parameters    |                customerid country stockcode description invoicedate sales_amt                 |
| Function Parameter Types |                          int8 varchar varchar varchar varchar float8                          |
|         IAM Role         |                                     default-aws-iam-role                                      |
|        S3 Bucket         |                                         amzn-s3-demo-bucket                                    |
|       Max Runtime        |                                             5400                                              |
+--------------------------+-----------------------------------------------------------------------------------------------+
```

## Step 3: Perform predictions with the model
<a name="tutorial_multi-class_classification_step_perform_predictions"></a>

The following query shows which customers qualify for your customer loyalty program. If the model predicts that the customer will be active for at least seven months, then the model selects the customer for the loyalty program.

```
SELECT
    customerid,
    predict_customer_activity(
        customerid,
        country,
        stockcode,
        description,
        invoicedate,
        sales_amt
    ) AS predicted_months_active
FROM
    ecommerce_sales_prediction
WHERE
    predicted_months_active >= 7
GROUP BY
    1,
    2
LIMIT
    10;
```

### Run prediction queries against the validation data (optional)
<a name="tutorial_multi-class_classification_run_validation_prediction"></a>

Run the following prediction queries against the validation data to see the model’s level of accuracy.

```
SELECT
    CAST(SUM(t1.match) AS decimal(7, 2)) AS predicted_matches,
    CAST(SUM(t1.nonmatch) AS decimal(7, 2)) AS predicted_non_matches,
    CAST(SUM(t1.match + t1.nonmatch) AS decimal(7, 2)) AS total_predictions,
    predicted_matches / total_predictions AS pct_accuracy
FROM
    (
        SELECT
            customerid,
            country,
            stockcode,
            description,
            invoicedate,
            sales_amt,
            nbr_months_active,
            predict_customer_activity(
                customerid,
                country,
                stockcode,
                description,
                invoicedate,
                sales_amt
            ) AS predicted_months_active,
            CASE
                WHEN nbr_months_active = predicted_months_active THEN 1
                ELSE 0
            END AS match,
            CASE
                WHEN nbr_months_active <> predicted_months_active THEN 1
                ELSE 0
            END AS nonmatch
        FROM
            ecommerce_sales_validation
    )t1;
```

### Predict how many customers miss entry (optional)
<a name="tutorial_multi-class_classification_run_missed_entries"></a>

The following query compares the number of customers that are predicted to be active for only 5 or 6 months. The model predicts that these customers will miss out on the loyalty program. The query then compares the amount that barely miss the program to the number that are predicted to be eligible for the loyalty program. This query could be used to inform a decision on whether to lower the threshold for the loyalty program. You can also determine if there is a significant amount of customers that are predicted to barely miss out on the program. You could then encourage those customers to increase their activity to get a loyalty program membership.

```
SELECT
    predict_customer_activity(
        customerid,
        country,
        stockcode,
        description,
        invoicedate,
        sales_amt
    ) AS predicted_months_active,
    COUNT(customerid)
FROM
    ecommerce_sales_prediction
WHERE
    predicted_months_active BETWEEN 5 AND 6
GROUP BY
    1
ORDER BY
    1 ASC
LIMIT
    10)
UNION
(SELECT
      NULL AS predicted_months_active,
    COUNT (customerid)
FROM 
    ecommerce_sales_prediction
WHERE
    predict_customer_activity(
        customerid,
        country,
        stockcode,
        description,
        invoicedate,
        sales_amt
    ) >=7);
```

## Related topics
<a name="tutorial_multi-class_classification_related_topics"></a>

For more information about Amazon Redshift ML, see the following documentation:
+ [Costs for using Amazon Redshift ML](https://docs.aws.amazon.com/redshift/latest/dg/cost.html)
+ [CREATE MODEL operation](https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_MODEL.html)
+ [EXPLAIN\$1MODEL function](https://docs.aws.amazon.com/redshift/latest/dg/r_explain_model_function.html)

For more information about machine learning, see the following documentation:
+ [Machine learning overview](https://docs.aws.amazon.com/redshift/latest/dg/machine_learning_overview.html)
+ [Machine learning for novices and experts](https://docs.aws.amazon.com/redshift/latest/dg/novice_expert.html)
+ [What Is Fairness and Model Explainability for Machine Learning Predictions?](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-fairness-and-explainability.html)

# Tutorial: Building XGBoost models
<a name="tutorial_xgboost"></a>

In this tutorial, you create a model with data from Amazon S3 and run prediction queries with the model using Amazon Redshift ML. The XGBoost algorithm is an optimized implementation of the gradient boosted trees algorithm. XGBoost handles more data types, relationships, and distributions than other gradient boosted trees algorithms. You can use XGBoost for regression, binary classification, multi-class classification, and ranking problems. For more information about the XGBoost algorithm, see [XGBoost algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) in the Amazon SageMaker AI Developer Guide.

The Amazon Redshift ML `CREATE MODEL` operation with the `AUTO OFF` option currently supports XGBoost as the `MODEL_TYPE`. You can provide relevant information such as the objective and hyperparameters as part of the `CREATE MODEL` command, based on your use case.

In this tutorial, you use the [banknote authentication dataset](https://archive.ics.uci.edu/ml/datasets/banknote+authentication), which is a binary classification problem to predict whether a given banknote is genuine or forged. 

## Use case examples
<a name="tutorial_xgboost_tasks"></a>

You can solve other binary classification problems using Amazon Redshift ML, such as predicting whether a patient is healthy or has a disease. You could also predict whether an email is spam or not spam.

**Tasks**
+ Prerequisites
+ Step 1: Load the data from Amazon S3 to Amazon Redshift
+ Step 2: Create the machine learning model
+ Step 3: Perform predictions with the model

## Prerequisites
<a name="tutorial_xgboost_prereqs"></a>

To complete this tutorial, you must complete the [Administrative setup](https://docs.aws.amazon.com/redshift/latest/dg/admin-setup.html) for Amazon Redshift ML.

## Step 1: Load the data from Amazon S3 to Amazon Redshift
<a name="tutorial_xgboost_step_load"></a>

Use the [Amazon Redshift query editor v2](https://docs.aws.amazon.com/redshift/latest/mgmt/query-editor-v2-using.html) to run the following queries.

The following query creates two tables, loads the data from Amazon S3, and splits the data into a training set and a testing set. You will use the training set to train your model and create the prediction function. Then, you will test the prediction function on the testing set.

```
--create training set table
CREATE TABLE banknoteauthentication_train(
    variance FLOAT,
    skewness FLOAT,
    curtosis FLOAT,
    entropy FLOAT,
    class INT
);

--Load into training table
COPY banknoteauthentication_train
FROM
    's3://redshiftbucket-ml-sagemaker/banknote_authentication/train_data/' IAM_ROLE default REGION 'us-west-2' IGNOREHEADER 1 CSV;

--create testing set table
CREATE TABLE banknoteauthentication_test(
    variance FLOAT,
    skewness FLOAT,
    curtosis FLOAT,
    entropy FLOAT,
    class INT
);

--Load data into testing table
COPY banknoteauthentication_test
FROM
    's3://redshiftbucket-ml-sagemaker/banknote_authentication/test_data/' 
    IAM_ROLE default 
    REGION 'us-west-2' 
    IGNOREHEADER 1 
    CSV;
```

## Step 2: Create the machine learning model
<a name="tutorial_xgboost_step_create_model"></a>

The following query creates the XGBoost model in Amazon Redshift ML from the training set you created in the previous step. Replace `amzn-s3-demo-bucket` with your own `S3_BUCKET`, which will store your input datasets and other Redshift ML artifacts.

```
CREATE MODEL model_banknoteauthentication_xgboost_binary
FROM
    banknoteauthentication_train 
    TARGET class 
    FUNCTION func_model_banknoteauthentication_xgboost_binary 
    IAM_ROLE default 
    AUTO OFF 
    MODEL_TYPE xgboost 
    OBJECTIVE 'binary:logistic' 
    PREPROCESSORS 'none' 
    HYPERPARAMETERS DEFAULT
EXCEPT(NUM_ROUND '100') 
SETTINGS(S3_BUCKET 'amzn-s3-demo-bucket');
```

### Show the status of model training (optional)
<a name="tutorial_xgboost_show_status"></a>

You can use the SHOW MODEL command to know when your model is ready.

Use the following query to monitor the progress of the model training.

```
SHOW MODEL model_banknoteauthentication_xgboost_binary;
```

If the model is `READY`, the SHOW MODEL operation also provides the `train:error` metric, as shown in the following example of the output. The `train:error` metric is a measure of accuracy of your model that measures to six decimal places. A value of 0 is most accurate and a value of 1 is least accurate.

```
+--------------------------+--------------------------------------------------+
|        Model Name        |   model_banknoteauthentication_xgboost_binary    |
+--------------------------+--------------------------------------------------+
| Schema Name              | public                                           |
| Owner                    | awsuser                                          |
| Creation Time            | Tue, 21.06.2022 19:07:35                         |
| Model State              | READY                                            |
| train:error              |                                         0.000000 |
| Estimated Cost           |                                         0.006197 |
|                          |                                                  |
| TRAINING DATA:           |                                                  |
| Query                    | SELECT *                                         |
|                          | FROM "BANKNOTEAUTHENTICATION_TRAIN"              |
| Target Column            | CLASS                                            |
|                          |                                                  |
| PARAMETERS:              |                                                  |
| Model Type               | xgboost                                          |
| Training Job Name        | redshiftml-20220621190735686935-xgboost          |
| Function Name            | func_model_banknoteauthentication_xgboost_binary |
| Function Parameters      | variance skewness curtosis entropy               |
| Function Parameter Types | float8 float8 float8 float8                      |
| IAM Role                 | default-aws-iam-role                             |
| S3 Bucket                | amzn-s3-demo-bucket                              |
| Max Runtime              |                                             5400 |
|                          |                                                  |
| HYPERPARAMETERS:         |                                                  |
| num_round                |                                              100 |
| objective                | binary:logistic                                  |
+--------------------------+--------------------------------------------------+
```

## Step 3: Perform predictions with the model
<a name="tutorial_xgboost_step_perform_predictions"></a>

### Check the accuracy of the model
<a name="tutorial_xgboost_check_accuracy"></a>

The following prediction query uses the prediction function created in the previous step to check the accuracy of your model. Run this query on the testing set to make sure the model does not correspond too closely to the training set. This close correspondence is also known as overfitting, and overfitting could cause the model to make unreliable predictions.

```
WITH predict_data AS (
    SELECT
        class AS label,
        func_model_banknoteauthentication_xgboost_binary (variance, skewness, curtosis, entropy) AS predicted,
        CASE
            WHEN label IS NULL THEN 0
            ELSE label
        END AS actual,
        CASE
            WHEN actual = predicted THEN 1 :: INT
            ELSE 0 :: INT
        END AS correct
    FROM
        banknoteauthentication_test
),
aggr_data AS (
    SELECT
        SUM(correct) AS num_correct,
        COUNT(*) AS total
    FROM
        predict_data
)
SELECT
    (num_correct :: FLOAT / total :: FLOAT) AS accuracy
FROM
    aggr_data;
```

### Predict the amount of original and counterfeit banknotes
<a name="tutorial_xgboost_predict_amount"></a>

The following prediction query returns the predicted amount of original and counterfeit banknotes in the testing set.

```
WITH predict_data AS (
    SELECT
        func_model_banknoteauthentication_xgboost_binary(variance, skewness, curtosis, entropy) AS predicted
    FROM
        banknoteauthentication_test
)
SELECT
    CASE
        WHEN predicted = '0' THEN 'Original banknote'
        WHEN predicted = '1' THEN 'Counterfeit banknote'
        ELSE 'NA'
    END AS banknote_authentication,
    COUNT(1) AS count
FROM
    predict_data
GROUP BY
    1;
```

### Find the average observation for an original and a counterfeit banknote
<a name="tutorial_xgboost_find_average_observation"></a>

The following prediction query returns the average value of each feature for banknotes that are predicted to be original and counterfeit in the testing set.

```
WITH predict_data AS (
    SELECT
        func_model_banknoteauthentication_xgboost_binary(variance, skewness, curtosis, entropy) AS predicted,
          variance,
          skewness,
          curtosis,
          entropy
    FROM
        banknoteauthentication_test
)
SELECT
    CASE
        WHEN predicted = '0' THEN 'Original banknote'
        WHEN predicted = '1' THEN 'Counterfeit banknote'
        ELSE 'NA'
    END AS banknote_authentication,
    TRUNC(AVG(variance), 2) AS avg_variance,
    TRUNC(AVG(skewness), 2) AS avg_skewness,
    TRUNC(AVG(curtosis), 2) AS avg_curtosis,
    TRUNC(AVG(entropy), 2) AS avg_entropy
FROM
    predict_data
GROUP BY
    1
ORDER BY
    2;
```

## Related topics
<a name="tutorial_xgboost_related_topics"></a>

For more information about Amazon Redshift ML, see the following documentation:
+ [Costs for using Amazon Redshift ML](https://docs.aws.amazon.com/redshift/latest/dg/cost.html)
+ [CREATE MODEL operation](https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_MODEL.html)
+ [EXPLAIN\$1MODEL function](https://docs.aws.amazon.com/redshift/latest/dg/r_explain_model_function.html)

For more information about machine learning, see the following documentation:
+ [Machine learning overview](https://docs.aws.amazon.com/redshift/latest/dg/machine_learning_overview.html)
+ [Machine learning for novices and experts](https://docs.aws.amazon.com/redshift/latest/dg/novice_expert.html)
+ [What Is Fairness and Model Explainability for Machine Learning Predictions?](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-fairness-and-explainability.html)

# Tutorial: Building regression models
<a name="tutorial_regression"></a>

In this tutorial, you use Amazon Redshift ML to create a machine learning regression model and run prediction queries on the model. Regression models allow you to predict numerical outcomes, such as the price of a house, or how many people will use a city’s bike rental service. You use the CREATE MODEL command in Amazon Redshift with your training data. Then, Amazon Redshift ML compiles the model, imports the trained model to Redshift, and prepares a SQL prediction function. You can use the prediction function in SQL queries in Amazon Redshift.

In this tutorial, you will use Amazon Redshift ML to build a regression model that predicts the number of people that use the city of Toronto’s bike sharing service at any given hour of a day. The inputs for the model include holidays and weather conditions. You will use a regression model, because you want a numerical outcome for this problem.

You can use the CREATE MODEL command to export training data, train a model, and make the model available in Amazon Redshift as a SQL function. Use the CREATE MODEL operation to specify training data either as a table or a SELECT statement.

## Use case examples
<a name="tutorial_regression_tasks"></a>

You can solve other regression problems with Amazon Redshift ML, such as predicting a customer’s lifetime value. You could also use Redshift ML to predict the most profitable price and the resulting revenue of a product.

**Tasks**
+ Prerequisites
+ Step 1: Load the data from Amazon S3 to Amazon Redshift
+ Step 2: Create the machine learning model
+ Step 3: Validate the model

## Prerequisites
<a name="tutorial_regression_prereqs"></a>

To complete this tutorial, you must complete the [Administrative setup](https://docs.aws.amazon.com/redshift/latest/dg/admin-setup.html) for Amazon Redshift ML.

## Step 1: Load the data from Amazon S3 to Amazon Redshift
<a name="tutorial_regression_step_load"></a>

Use the [Amazon Redshift query editor v2](https://docs.aws.amazon.com/redshift/latest/mgmt/query-editor-v2-using.html) to run the following queries.

1. You must create three tables to load the three public datasets into Amazon Redshift. The datasets are [Toronto Bike Ridership Data](https://open.toronto.ca/dataset/bike-share-toronto-ridership-data/), [historical weather data](https://climate.weather.gc.ca/historical_data/search_historic_data_e.html), and [historical holidays data](https://github.com/uWaterloo/Datasets/blob/master/Holidays/holidays.csv). Run the following query in the Amazon Redshift query editor to create tables named `ridership`, `weather`, and `holiday`.

   ```
   CREATE TABLE IF NOT EXISTS ridership (
       trip_id INT,
       trip_duration_seconds INT,
       trip_start_time timestamp,
       trip_stop_time timestamp,
       from_station_name VARCHAR(50),
       to_station_name VARCHAR(50),
       from_station_id SMALLINT,
       to_station_id SMALLINT,
       user_type VARCHAR(20)
   );
   
   CREATE TABLE IF NOT EXISTS weather (
       longitude_x DECIMAL(5, 2),
       latitude_y DECIMAL(5, 2),
       station_name VARCHAR(20),
       climate_id BIGINT,
       datetime_utc TIMESTAMP,
       weather_year SMALLINT,
       weather_month SMALLINT,
       weather_day SMALLINT,
       time_utc VARCHAR(5),
       temp_c DECIMAL(5, 2),
       temp_flag VARCHAR(1),
       dew_point_temp_c DECIMAL(5, 2),
       dew_point_temp_flag VARCHAR(1),
       rel_hum SMALLINT,
       rel_hum_flag VARCHAR(1),
       precip_amount_mm DECIMAL(5, 2),
       precip_amount_flag VARCHAR(1),
       wind_dir_10s_deg VARCHAR(10),
       wind_dir_flag VARCHAR(1),
       wind_spd_kmh VARCHAR(10),
       wind_spd_flag VARCHAR(1),
       visibility_km VARCHAR(10),
       visibility_flag VARCHAR(1),
       stn_press_kpa DECIMAL(5, 2),
       stn_press_flag VARCHAR(1),
       hmdx SMALLINT,
       hmdx_flag VARCHAR(1),
       wind_chill VARCHAR(10),
       wind_chill_flag VARCHAR(1),
       weather VARCHAR(10)
   );
   
   CREATE TABLE IF NOT EXISTS holiday (holiday_date DATE, description VARCHAR(100));
   ```

1. The following query loads the sample data into the tables that you created in the previous step.

   ```
   COPY ridership
   FROM
       's3://redshift-ml-bikesharing-data/bike-sharing-data/ridership/' 
       IAM_ROLE default 
       FORMAT CSV 
       IGNOREHEADER 1 
       DATEFORMAT 'auto' 
       TIMEFORMAT 'auto' 
       REGION 'us-west-2' 
       gzip;
   
   COPY weather
   FROM
       's3://redshift-ml-bikesharing-data/bike-sharing-data/weather/' 
       IAM_ROLE default 
       FORMAT csv 
       IGNOREHEADER 1 
       DATEFORMAT 'auto' 
       TIMEFORMAT 'auto' 
       REGION 'us-west-2' 
       gzip;
   
   COPY holiday
   FROM
       's3://redshift-ml-bikesharing-data/bike-sharing-data/holiday/' 
       IAM_ROLE default 
       FORMAT csv 
       IGNOREHEADER 1 
       DATEFORMAT 'auto' 
       TIMEFORMAT 'auto' 
       REGION 'us-west-2' 
       gzip;
   ```

1. The following query performs transformations on the `ridership` and `weather` datasets to remove bias or anomalies. Removing bias and anomalies results in improved model accuracy. The query simplifies the tables by creating two new views called `ridership_view` and `weather_view`.

   ```
   CREATE
   OR REPLACE VIEW ridership_view AS
   SELECT
       trip_time,
       trip_count,
       TO_CHAR(trip_time, 'hh24') :: INT trip_hour,
       TO_CHAR(trip_time, 'dd') :: INT trip_day,
       TO_CHAR(trip_time, 'mm') :: INT trip_month,
       TO_CHAR(trip_time, 'yy') :: INT trip_year,
       TO_CHAR(trip_time, 'q') :: INT trip_quarter,
       TO_CHAR(trip_time, 'w') :: INT trip_month_week,
       TO_CHAR(trip_time, 'd') :: INT trip_week_day
   FROM
       (
           SELECT
               CASE
                   WHEN TRUNC(r.trip_start_time) < '2017-07-01' :: DATE THEN CONVERT_TIMEZONE(
                       'US/Eastern',
                       DATE_TRUNC('hour', r.trip_start_time)
                   )
                   ELSE DATE_TRUNC('hour', r.trip_start_time)
               END trip_time,
               COUNT(1) trip_count
           FROM
               ridership r
           WHERE
               r.trip_duration_seconds BETWEEN 60
               AND 60 * 60 * 24
           GROUP BY
               1
       );
   
   CREATE
   OR REPLACE VIEW weather_view AS
   SELECT
       CONVERT_TIMEZONE(
           'US/Eastern',
           DATE_TRUNC('hour', datetime_utc)
       ) daytime,
       ROUND(AVG(temp_c)) temp_c,
       ROUND(AVG(precip_amount_mm)) precip_amount_mm
   FROM
       weather
   GROUP BY
       1;
   ```

1. The following query creates a table that combines all the relevant input attributes from `ridership_view` and `weather_view` into the `trip_data` table.

   ```
   CREATE TABLE trip_data AS
   SELECT
       r.trip_time,
       r.trip_count,
       r.trip_hour,
       r.trip_day,
       r.trip_month,
       r.trip_year,
       r.trip_quarter,
       r.trip_month_week,
       r.trip_week_day,
       w.temp_c,
       w.precip_amount_mm,CASE
           WHEN h.holiday_date IS NOT NULL THEN 1
           WHEN TO_CHAR(r.trip_time, 'D') :: INT IN (1, 7) THEN 1
           ELSE 0
       END is_holiday,
       ROW_NUMBER() OVER (
           ORDER BY
               RANDOM()
       ) serial_number
   FROM
       ridership_view r
       JOIN weather_view w ON (r.trip_time = w.daytime)
       LEFT OUTER JOIN holiday h ON (TRUNC(r.trip_time) = h.holiday_date);
   ```

### View the sample data (optional)
<a name="tutorial_regression_view_data"></a>

The following query shows entries from the table. You can run this operation to make sure the table was made correctly.

```
SELECT * 
FROM trip_data 
LIMIT 5;
```

The following is an example of the output of the previous operation.

```
+---------------------+------------+-----------+----------+------------+-----------+--------------+-----------------+---------------+--------+------------------+------------+---------------+
|      trip_time      | trip_count | trip_hour | trip_day | trip_month | trip_year | trip_quarter | trip_month_week | trip_week_day | temp_c | precip_amount_mm | is_holiday | serial_number |
+---------------------+------------+-----------+----------+------------+-----------+--------------+-----------------+---------------+--------+------------------+------------+---------------+
| 2017-03-21 22:00:00 |         47 |        22 |       21 |          3 |        17 |            1 |               3 |             3 |      1 |                0 |          0 |             1 |
| 2018-05-04 01:00:00 |         19 |         1 |        4 |          5 |        18 |            2 |               1 |             6 |     12 |                0 |          0 |             3 |
| 2018-01-11 10:00:00 |         93 |        10 |       11 |          1 |        18 |            1 |               2 |             5 |      9 |                0 |          0 |             5 |
| 2017-10-28 04:00:00 |         20 |         4 |       28 |         10 |        17 |            4 |               4 |             7 |     11 |                0 |          1 |             7 |
| 2017-12-31 21:00:00 |         11 |        21 |       31 |         12 |        17 |            4 |               5 |             1 |    -15 |                0 |          1 |             9 |
+---------------------+------------+-----------+----------+------------+-----------+--------------+-----------------+---------------+--------+------------------+------------+---------------+
```

### Show the correlation between attributes (optional)
<a name="tutorial_regression_show_correlation"></a>

Determining correlation helps you measure the strength of association between attributes. The level of association can help you determine what affects your target output. In this tutorial, the target output is `trip_count`.

The following query creates or replaces the `sp_correlation` procedure. You use the stored procedure called `sp_correlation` to show the correlation between an attribute and other attributes in a table in Amazon Redshift.

```
CREATE OR REPLACE PROCEDURE sp_correlation(source_schema_name in varchar(255), source_table_name in varchar(255), target_column_name in varchar(255), output_temp_table_name inout varchar(255)) AS $$
DECLARE
  v_sql varchar(max);
  v_generated_sql varchar(max);
  v_source_schema_name varchar(255)=lower(source_schema_name);
  v_source_table_name varchar(255)=lower(source_table_name);
  v_target_column_name varchar(255)=lower(target_column_name);
BEGIN
  EXECUTE 'DROP TABLE IF EXISTS ' || output_temp_table_name;
  v_sql = '
SELECT
  ''CREATE temp table '|| output_temp_table_name||' AS SELECT ''|| outer_calculation||
  '' FROM (SELECT COUNT(1) number_of_items, SUM('||v_target_column_name||') sum_target, SUM(POW('||v_target_column_name||',2)) sum_square_target, POW(SUM('||v_target_column_name||'),2) square_sum_target,''||
  inner_calculation||
  '' FROM (SELECT ''||
  column_name||
  '' FROM '||v_source_table_name||'))''
FROM
  (
  SELECT
    DISTINCT
    LISTAGG(outer_calculation,'','') OVER () outer_calculation
    ,LISTAGG(inner_calculation,'','') OVER () inner_calculation
    ,LISTAGG(column_name,'','') OVER () column_name
  FROM
    (
    SELECT
      CASE WHEN atttypid=16 THEN ''DECODE(''||column_name||'',true,1,0)'' ELSE column_name END column_name
      ,atttypid
      ,''CAST(DECODE(number_of_items * sum_square_''||rn||'' - square_sum_''||rn||'',0,null,(number_of_items*sum_target_''||rn||'' - sum_target * sum_''||rn||
        '')/SQRT((number_of_items * sum_square_target - square_sum_target) * (number_of_items * sum_square_''||rn||
        '' - square_sum_''||rn||''))) AS numeric(5,2)) ''||column_name outer_calculation
      ,''sum(''||column_name||'') sum_''||rn||'',''||
            ''SUM(trip_count*''||column_name||'') sum_target_''||rn||'',''||
            ''SUM(POW(''||column_name||'',2)) sum_square_''||rn||'',''||
            ''POW(SUM(''||column_name||''),2) square_sum_''||rn inner_calculation
    FROM
      (
      SELECT
        row_number() OVER (order by a.attnum) rn
        ,a.attname::VARCHAR column_name
        ,a.atttypid
      FROM pg_namespace AS n
        INNER JOIN pg_class AS c ON n.oid = c.relnamespace
        INNER JOIN pg_attribute AS a ON c.oid = a.attrelid
      WHERE a.attnum > 0
        AND n.nspname = '''||v_source_schema_name||'''
        AND c.relname = '''||v_source_table_name||'''
        AND a.atttypid IN (16,20,21,23,700,701,1700)
      )
    )
)';
  EXECUTE v_sql INTO v_generated_sql;
  EXECUTE  v_generated_sql;
END;
$$ LANGUAGE plpgsql;
```

The following query shows the correlation between the target column, `trip_count`, and other numeric attributes in our dataset.

```
call sp_correlation(
    'public',
    'trip_data',
    'trip_count',
    'tmp_corr_table'
);

SELECT
    *
FROM
    tmp_corr_table;
```

The following example shows the output of the previous `sp_correlation` operation.

```
+------------+-----------+----------+------------+-----------+--------------+-----------------+---------------+--------+------------------+------------+---------------+
| trip_count | trip_hour | trip_day | trip_month | trip_year | trip_quarter | trip_month_week | trip_week_day | temp_c | precip_amount_mm | is_holiday | serial_number |
+------------+-----------+----------+------------+-----------+--------------+-----------------+---------------+--------+------------------+------------+---------------+
|          1 |      0.32 |     0.01 |       0.18 |      0.12 |         0.18 |               0 |          0.02 |   0.53 |            -0.07 |      -0.13 |             0 |
+------------+-----------+----------+------------+-----------+--------------+-----------------+---------------+--------+------------------+------------+---------------+
```

## Step 2: Create the machine learning model
<a name="tutorial_regression_create_model"></a>

1. The following query splits your data into a training set and a validation set by designating 80% of the dataset for training and 20% for validation. The training set is the input for the ML model to identify the best possible algorithm for the model. After the model is created, you use the validation set to validate the model accuracy.

   ```
   CREATE TABLE training_data AS
   SELECT
       trip_count,
       trip_hour,
       trip_day,
       trip_month,
       trip_year,
       trip_quarter,
       trip_month_week,
       trip_week_day,
       temp_c,
       precip_amount_mm,
       is_holiday
   FROM
       trip_data
   WHERE
       serial_number > (
           SELECT
               COUNT(1) * 0.2
           FROM
               trip_data
       );
   
   CREATE TABLE validation_data AS
   SELECT
       trip_count,
       trip_hour,
       trip_day,
       trip_month,
       trip_year,
       trip_quarter,
       trip_month_week,
       trip_week_day,
       temp_c,
       precip_amount_mm,
       is_holiday,
       trip_time
   FROM
       trip_data
   WHERE
       serial_number <= (
           SELECT
               COUNT(1) * 0.2
           FROM
               trip_data
       );
   ```

1. The following query creates a regression model to predict the `trip_count` value for any input date and time. In the following example, replace amzn-s3-demo-bucket with your own S3 bucket.

   ```
   CREATE MODEL predict_rental_count
   FROM
       training_data TARGET trip_count FUNCTION predict_rental_count 
       IAM_ROLE default 
       PROBLEM_TYPE regression 
       OBJECTIVE 'mse' 
       SETTINGS (
           s3_bucket 'amzn-s3-demo-bucket',
           s3_garbage_collect off,
           max_runtime 5000
       );
   ```

## Step 3: Validate the model
<a name="tutorial_regression_step_validate"></a>

1. Use the following query to output aspects of the model, and find the mean square error metric in the output. Mean square error is a typical accuracy metric for regression problems.

   ```
   show model predict_rental_count;
   ```

1. Run the following prediction queries against the validation data to compare the predicted trip count to the actual trip count.

   ```
   SELECT
       trip_time,
       actual_count,
       predicted_count,
       (actual_count - predicted_count) difference
   FROM
       (
           SELECT
               trip_time,
               trip_count AS actual_count,
               PREDICT_RENTAL_COUNT (
                   trip_hour,
                   trip_day,
                   trip_month,
                   trip_year,
                   trip_quarter,
                   trip_month_week,
                   trip_week_day,
                   temp_c,
                   precip_amount_mm,
                   is_holiday
               ) predicted_count
           FROM
               validation_data
       )
   LIMIT
       5;
   ```

1. The following query calculates the mean square error and root mean square error based on your validation data. You use mean square error and root mean square error to measure the distance between the predicted numeric target and the actual numeric answer. A good model has a low score in both metrics. The following query returns the value of both metrics.

   ```
   SELECT
       ROUND(
           AVG(POWER((actual_count - predicted_count), 2)),
           2
       ) mse,
       ROUND(
           SQRT(AVG(POWER((actual_count - predicted_count), 2))),
           2
       ) rmse
   FROM
       (
           SELECT
               trip_time,
               trip_count AS actual_count,
               PREDICT_RENTAL_COUNT (
                   trip_hour,
                   trip_day,
                   trip_month,
                   trip_year,
                   trip_quarter,
                   trip_month_week,
                   trip_week_day,
                   temp_c,
                   precip_amount_mm,
                   is_holiday
               ) predicted_count
           FROM
               validation_data
       );
   ```

1. The following query calculates the percent error in trip count for each trip time on 2017-01-01. The query orders the trip times from the time with the lowest percent error to the time with the highest percent error.

   ```
   SELECT
       trip_time,
       CAST(ABS(((actual_count - predicted_count) / actual_count)) * 100 AS DECIMAL (7,2)) AS pct_error
   FROM
       (
           SELECT
               trip_time,
               trip_count AS actual_count,
               PREDICT_RENTAL_COUNT (
                   trip_hour,
                   trip_day,
                   trip_month,
                   trip_year,
                   trip_quarter,
                   trip_month_week,
                   trip_week_day,
                   temp_c,
                   precip_amount_mm,
                   is_holiday
               ) predicted_count
           FROM
               validation_data
       )
   WHERE
      trip_time LIKE '2017-01-01 %%:%%:%%'
   ORDER BY
      2 ASC;
   ```

## Related topics
<a name="tutorial_regression_related_topics"></a>

For more information about Amazon Redshift ML, see the following documentation:
+ [Costs for using Amazon Redshift ML](https://docs.aws.amazon.com/redshift/latest/dg/cost.html)
+ [CREATE MODEL operation](https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_MODEL.html)
+ [EXPLAIN\$1MODEL function](https://docs.aws.amazon.com/redshift/latest/dg/r_explain_model_function.html)

For more information about machine learning, see the following documentation:
+ [Machine learning overview](https://docs.aws.amazon.com/redshift/latest/dg/machine_learning_overview.html)
+ [Machine learning for novices and experts](https://docs.aws.amazon.com/redshift/latest/dg/novice_expert.html)
+ [What Is Fairness and Model Explainability for Machine Learning Predictions?](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-fairness-and-explainability.html)

# Tutorial: Building regression models with linear learner
<a name="tutorial_linear_learner_regression"></a>

In this tutorial, you create a linear learner model with data from Amazon S3 and run prediction queries with the model using Amazon Redshift ML. The SageMaker AI linear learner algorithm solves either regression or multi-class classification problems. To learn more about regression and multi-class classification problems, see [Problem types for the machine learning paradigms](https://docs.aws.amazon.com/sagemaker/latest/dg/algorithms-choose.html#basic-machine-learning-paradigms) in the Amazon SageMaker AI Developer Guide. In this tutorial, you solve a regression problem. The linear learner algorithm trains many models in parallel, and automatically determines the most optimized model. You use the CREATE MODEL operation in Amazon Redshift, which creates your linear learner model using SageMaker AI and sends a prediction function to Amazon Redshift. For more information about the linear learner algorithm, see [Linear Learner Algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/linear-learner.html) in the Amazon SageMaker AI Developer Guide.

You can use a CREATE MODEL command to export training data, train a model, import the model, and prepare an Amazon Redshift prediction function. Use the CREATE MODEL operation to specify training data either as a table or a SELECT statement.

Linear learner models optimize either continuous objectives or discrete objectives. Continuous objectives are used for regression, while discrete variables are used for classification. Some methods provide a solution for only continuous objectives, such as the regression method. The linear learner algorithm provides an increase in speed over naive hyperparameter optimization techniques, such as the Naive Bayes technique. A naive optimization technique assumes that each input variable is independent. To use the linear learner algorithm, you must provide columns representing the dimensions of the inputs, and rows representing the observations. For more information about the linear learner algorithm, see the [Linear Learner Algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/linear-learner.html) in the Amazon SageMaker AI Developer Guide. 

In this tutorial, you build a linear learner model that predicts the age of abalone. You use the CREATE MODEL command on the [Abalone dataset](http://archive.ics.uci.edu/ml/datasets/Abalone) to determine the relationship between the physical measurements of abalone. Then, you use the model to determine the age of abalone.

## Use case examples
<a name="tutorial_linear_learner_regression_tasks"></a>

You can solve other regression problems with linear learner and Amazon Redshift ML, such as predicting the price of a house. You could also use Redshift ML to predict the number of people who will use a city’s bike rental service.

**Tasks**
+ Prerequisites
+ Step 1: Load the data from Amazon S3 to Amazon Redshift
+ Step 2: Create the machine learning model
+ Step 3: Validate the model

## Prerequisites
<a name="tutorial_linear_learner_regression_prereqs"></a>

To complete this tutorial, you must complete the [Administrative setup](https://docs.aws.amazon.com/redshift/latest/dg/admin-setup.html) for Amazon Redshift ML.

## Step 1: Load the data from Amazon S3 to Amazon Redshift
<a name="tutorial_linear_learner_regression_step_load_data"></a>

Use the [Amazon Redshift query editor v2](https://docs.aws.amazon.com/redshift/latest/mgmt/query-editor-v2-using.html) to run the following queries. These queries load the sample data into Redshift and divide the data into a training set and a validation set.

1. The following query creates the `abalone_dataset` table.

   ```
   CREATE TABLE abalone_dataset (
       id INT IDENTITY(1, 1),
       Sex CHAR(1),
       Length float,
       Diameter float,
       Height float,
       Whole float,
       Shucked float,
       Viscera float,
       Shell float,
       Rings integer
   );
   ```

1. The following query copies the sample data from the [Abalone dataset](http://archive.ics.uci.edu/ml/datasets/Abalone) in Amazon S3 to the `abalone_dataset` table you created previously in Amazon Redshift.

   ```
   COPY abalone_dataset
   FROM
       's3://redshift-ml-multiclass/abalone.csv' REGION 'us-east-1' IAM_ROLE default CSV IGNOREHEADER 1 NULL AS 'NULL';
   ```

1. By manually splitting the data, you will be able to verify the accuracy of the model by allocating an additional prediction set. The following query splits the data into two sets. The `abalone_training` table is for training and the `abalone_validation` table is for validation.

   ```
   CREATE TABLE abalone_training as 
   SELECT 
       *
   FROM
       abalone_dataset
   WHERE
       mod(id, 10) < 8;
   
   CREATE TABLE abalone_validation as 
   SELECT 
       *
   FROM
       abalone_dataset
   WHERE
       mod(id, 10) >= 8;
   ```

## Step 2: Create the machine learning model
<a name="tutorial_linear_learner_regression_step_create_model"></a>

In this step, you use the CREATE MODEL statement to create your machine learning model with the linear learner algorithm. 

The following query creates the linear learner model with the CREATE MODEL operation using your S3 bucket. Replace amzn-s3-demo-bucket with your own S3 bucket.

```
CREATE MODEL model_abalone_ring_prediction
FROM
    (
        SELECT
            Sex,
            Length,
            Diameter,
            Height,
            Whole,
            Shucked,
            Viscera,
            Shell,
            Rings AS target_label
        FROM
            abalone_training
    ) TARGET target_label FUNCTION f_abalone_ring_prediction IAM_ROLE default MODEL_TYPE LINEAR_LEARNER PROBLEM_TYPE REGRESSION OBJECTIVE 'MSE' SETTINGS (
        S3_BUCKET 'amzn-s3-demo-bucket',
        MAX_RUNTIME 15000
    );
```

### Show the status of model training (optional)
<a name="tutorial_linear_learner_regression_show_status"></a>

You can use the SHOW MODEL command to know when your model is ready.

Use the following query to monitor the progress of the model training.

```
SHOW MODEL model_abalone_ring_prediction;
```

When the model is ready, the output of the previous operation should look similar to the following example. Note that the output provides the `validation:mse` metric, which is the mean square error. You will use the mean square error to validate the accuracy of the model in the next step.

```
+--------------------------+----------------------------------------------------------------------------------------------------+
|        Model Name        |                                   model_abalone_ring_prediction                                    |
+--------------------------+----------------------------------------------------------------------------------------------------+
| Schema Name              | public                                                                                             |
| Owner                    | awsuser                                                                                            |
| Creation Time            | Thu, 30.06.2022 18:00:10                                                                           |
| Model State              | READY                                                                                              |
| validation:mse           |                                                                                           4.168633 |
| Estimated Cost           |                                                                                           4.291608 |
|                          |                                                                                                    |
| TRAINING DATA:           |                                                                                                    |
| Query                    | SELECT SEX , LENGTH , DIAMETER , HEIGHT , WHOLE , SHUCKED , VISCERA , SHELL, RINGS AS TARGET_LABEL |
|                          | FROM ABALONE_TRAINING                                                                              |
| Target Column            | TARGET_LABEL                                                                                       |
|                          |                                                                                                    |
| PARAMETERS:              |                                                                                                    |
| Model Type               | linear_learner                                                                                     |
| Problem Type             | Regression                                                                                         |
| Objective                | MSE                                                                                                |
| AutoML Job Name          | redshiftml-20220630180010947843                                                                    |
| Function Name            | f_abalone_ring_prediction                                                                          |
| Function Parameters      | sex length diameter height whole shucked viscera shell                                             |
| Function Parameter Types | bpchar float8 float8 float8 float8 float8 float8 float8                                            |
| IAM Role                 | default-aws-iam-role                                                                               |
| S3 Bucket                | amzn-s3-demo-bucket                                                                                |
| Max Runtime              |                                                                                              15000 |
+--------------------------+----------------------------------------------------------------------------------------------------+
```

## Step 3: Validate the model
<a name="tutorial_linear_learner_regression_step_validate"></a>

1. The following prediction query validates the accuracy of the model on the `abalone_validation` dataset by calculating mean square error and root mean square error.

   ```
   SELECT
       ROUND(AVG(POWER((tgt_label - predicted), 2)), 2) mse,
       ROUND(SQRT(AVG(POWER((tgt_label - predicted), 2))), 2) rmse
   FROM
       (
           SELECT
               Sex,
               Length,
               Diameter,
               Height,
               Whole,
               Shucked,
               Viscera,
               Shell,
               Rings AS tgt_label,
               f_abalone_ring_prediction(
                   Sex,
                   Length,
                   Diameter,
                   Height,
                   Whole,
                   Shucked,
                   Viscera,
                   Shell
               ) AS predicted,
               CASE
                   WHEN tgt_label = predicted then 1
                   ELSE 0
               END AS match,
               CASE
                   WHEN tgt_label <> predicted then 1
                   ELSE 0
               END AS nonmatch
           FROM
               abalone_validation
       ) t1;
   ```

   The output of the previous query should look like the following example. The value of the mean square error metric should be similar to the `validation:mse` metric shown by the SHOW MODEL operation’s output.

   ```
   +-----+--------------------+
   | mse |        rmse        |
   +-----+--------------------+
   | 5.1 | 2.2600000000000002 |
   +-----+--------------------+
   ```

1. Use the following query to run the EXPLAIN\$1MODEL operation on your prediction function. The operation will return a model explainability report. For more information about the EXPLAIN\$1MODEL operation, see the [EXPLAIN\$1MODEL function](https://docs.aws.amazon.com/redshift/latest/dg/r_explain_model_function.html) in the Amazon Redshift Database Developer Guide.

   ```
   SELECT
       EXPLAIN_MODEL ('model_abalone_ring_prediction');
   ```

   The following information is an example of the model explainability report produced by the previous EXPLAIN\$1MODEL operation. The values for each of the inputs are Shapley values. The Shapley values represent the effect each input has on the prediction of your model, with higher-valued inputs having more impact on the prediction. In this example, the higher-valued inputs have more impact on predicting the age of abalone.

   ```
   { 
       "explanations": { 
           "kernel_shap": { 
               "label0": { 
                   "expected_value" :10.290688514709473,
                   "global_shap_values": { 
                       "diameter" :0.6856910187882492,
                       "height" :0.4415323937124035,
                       "length" :0.21507476107609084,
                       "sex" :0.448611774505744,
                       "shell" :1.70426496893776,
                       "shucked" :2.1181392924386994,
                       "viscera" :0.342220754059912,
                       "whole" :0.6711906974084011 
                   } 
               } 
           } 
       },
       "version" :"1.0" 
   };
   ```

1. Use the following query to calculate the percentage of correct predictions that the model makes about abalone that are not yet mature. Abalone that are immature have 10 rings or less, and a correct prediction is accurate to within one ring of the actual number of rings.

   ```
   SELECT
       TRUNC(
           SUM(
               CASE
                   WHEN ROUND(
                       f_abalone_ring_prediction(
                           Sex,
                           Length,
                           Diameter,
                           Height,
                           Whole,
                           Shucked,
                           Viscera,
                           Shell
                       ),
                       0
                   ) BETWEEN Rings - 1
                   AND Rings + 1 THEN 1
                   ELSE 0
               END
           ) / CAST(COUNT(SHELL) AS FLOAT),
           4
       ) AS prediction_pct
   FROM
       abalone_validation
   WHERE
       Rings <= 10;
   ```

## Related topics
<a name="tutorial_linear_learner_regression_related_topics"></a>

For more information about Amazon Redshift ML, see the following documentation:
+ [Costs for using Amazon Redshift ML](https://docs.aws.amazon.com/redshift/latest/dg/cost.html)
+ [CREATE MODEL operation](https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_MODEL.html)
+ [EXPLAIN\$1MODEL function](https://docs.aws.amazon.com/redshift/latest/dg/r_explain_model_function.html)

For more information about machine learning, see the following documentation:
+ [Machine learning overview](https://docs.aws.amazon.com/redshift/latest/dg/machine_learning_overview.html)
+ [Machine learning for novices and experts](https://docs.aws.amazon.com/redshift/latest/dg/novice_expert.html)
+ [What Is Fairness and Model Explainability for Machine Learning Predictions?](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-fairness-and-explainability.html)

# Tutorial: Building multi-class classification models with linear learner
<a name="tutorial_linear_learner_multi-class_classification"></a>

In this tutorial, you create a linear learner model with data from Amazon S3, and then run prediction queries with the model using Amazon Redshift ML. The SageMaker AI linear learner algorithm solves either regression or classification problems. To learn more about regression and multi-class classification problems, see [Problem types for the machine learning paradigms](https://docs.aws.amazon.com/sagemaker/latest/dg/algorithms-choose.html#basic-machine-learning-paradigms) in the Amazon SageMaker AI Developer Guide. In this tutorial, you solve a multi-class classification problem. The linear learner algorithm trains many models in parallel, and automatically determines the most optimized model. You use the CREATE MODEL operation in Amazon Redshift, which creates your linear learner model using SageMaker AI and sends the prediction function to Amazon Redshift. For more information about the linear learner algorithm, see the [Linear Learner Algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/linear-learner.html) in the Amazon SageMaker AI Developer Guide.

You can use a CREATE MODEL command to export training data, train a model, import the model, and prepare an Amazon Redshift prediction function. Use the CREATE MODEL operation to specify training data either as a table or a SELECT statement.

Linear learner models optimize either continuous objectives or discrete objectives. Continuous objectives are used for regression, while discrete variables are used for classification. Some methods provide a solution for only continuous objectives, such as a regression method. The linear learner algorithm provides an increase in speed over naive hyperparameter optimization techniques, such as the Naive Bayes technique. A naive optimization technique assumes that each input variable is independent. The linear learner algorithm trains many models in parallel and selects the most optimized model. A similar algorithm is XGBoost, which combines estimates from a set of simpler and weaker models to make predictions. To learn more about XGBoost, see [XGBoost algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) in the Amazon SageMaker AI Developer Guide.

To use the linear learner algorithm, you must provide columns representing the dimensions of the inputs, and rows representing the observations. For more information about the linear learner algorithm, see the [Linear Learner Algorithm ](https://docs.aws.amazon.com/sagemaker/latest/dg/linear-learner.html)in the Amazon SageMaker AI Developer Guide. 

In this tutorial, you build a linear learner model that predicts the types of cover for a given area. You use the CREATE MODEL command on the [Covertype dataset](https://archive.ics.uci.edu/ml/datasets/covertype) from the UCI Machine Learning Repository. Then, you use the prediction function created by the command to determine the types of cover in a wilderness area. A forest cover type is usually a type of tree. The inputs that Redshift ML will use to create the model include soil type, distance to roadways, and wilderness area designation. For more information about the dataset, see the [Covertype Dataset](https://archive.ics.uci.edu/ml/datasets/covertype) from the UCI Machine Learning Repository.

## Use case examples
<a name="tutorial_linear_learner_multi-class_classification_tasks"></a>

You can solve other multi-class classification problems with linear learner with Amazon Redshift ML, such as predicting the species of a plant from an image. You could also predict the quantity of a product that a customer will purchase.

**Tasks**
+ Prerequisites
+ Step 1: Load the data from Amazon S3 to Amazon Redshift
+ Step 2: Create the machine learning model
+ Step 3: Validate the model

## Prerequisites
<a name="tutorial_linear_learner_multi-class_classification_prereqs"></a>

To complete this tutorial, you must complete the [Administrative setup](https://docs.aws.amazon.com/redshift/latest/dg/admin-setup.html) for Amazon Redshift ML.

## Step 1: Load the data from Amazon S3 to Amazon Redshift
<a name="tutorial_linear_learner_multi-class_classification_step_load"></a>

Use the [Amazon Redshift query editor v2](https://docs.aws.amazon.com/redshift/latest/mgmt/query-editor-v2-using.html) to run the following queries. These queries load the sample data into Redshift and divide the data into a training set and a validation set.

1. The following query creates the `covertype_data` table.

   ```
   CREATE TABLE public.covertype_data (
       elevation bigint ENCODE az64,
       aspect bigint ENCODE az64,
       slope bigint ENCODE az64,
       horizontal_distance_to_hydrology bigint ENCODE az64,
       vertical_distance_to_hydrology bigint ENCODE az64,
       horizontal_distance_to_roadways bigint ENCODE az64,
       hillshade_9am bigint ENCODE az64,
       hillshade_noon bigint ENCODE az64,
       hillshade_3pm bigint ENCODE az64,
       horizontal_distance_to_fire_points bigint ENCODE az64,
       wilderness_area1 bigint ENCODE az64,
       wilderness_area2 bigint ENCODE az64,
       wilderness_area3 bigint ENCODE az64,
       wilderness_area4 bigint ENCODE az64,
       soil_type1 bigint ENCODE az64,
       soil_type2 bigint ENCODE az64,
       soil_type3 bigint ENCODE az64,
       soil_type4 bigint ENCODE az64,
       soil_type5 bigint ENCODE az64,
       soil_type6 bigint ENCODE az64,
       soil_type7 bigint ENCODE az64,
       soil_type8 bigint ENCODE az64,
       soil_type9 bigint ENCODE az64,
       soil_type10 bigint ENCODE az64,
       soil_type11 bigint ENCODE az64,
       soil_type12 bigint ENCODE az64,
       soil_type13 bigint ENCODE az64,
       soil_type14 bigint ENCODE az64,
       soil_type15 bigint ENCODE az64,
       soil_type16 bigint ENCODE az64,
       soil_type17 bigint ENCODE az64,
       soil_type18 bigint ENCODE az64,
       soil_type19 bigint ENCODE az64,
       soil_type20 bigint ENCODE az64,
       soil_type21 bigint ENCODE az64,
       soil_type22 bigint ENCODE az64,
       soil_type23 bigint ENCODE az64,
       soil_type24 bigint ENCODE az64,
       soil_type25 bigint ENCODE az64,
       soil_type26 bigint ENCODE az64,
       soil_type27 bigint ENCODE az64,
       soil_type28 bigint ENCODE az64,
       soil_type29 bigint ENCODE az64,
       soil_type30 bigint ENCODE az64,
       soil_type31 bigint ENCODE az64,
       soil_type32 bigint ENCODE az64,
       soil_type33 bigint ENCODE az64,
       soil_type34 bigint ENCODE az64,
       soil_type35 bigint ENCODE az64,
       soil_type36 bigint ENCODE az64,
       soil_type37 bigint ENCODE az64,
       soil_type38 bigint ENCODE az64,
       soil_type39 bigint ENCODE az64,
       soil_type40 bigint ENCODE az64,
       cover_type bigint ENCODE az64
   ) DISTSTYLE AUTO;
   ```

1. The following query copies the sample data from the [Covertype dataset ](https://archive.ics.uci.edu/ml/datasets/covertype)in Amazon S3 to the `covertype_data` table you created previously in Amazon Redshift.

   ```
   COPY public.covertype_data
   FROM
       's3://redshift-ml-multiclass/covtype.data.gz' IAM_ROLE DEFAULT gzip DELIMITER ',' REGION 'us-east-1';
   ```

1. By manually splitting the data, you will be able to verify the accuracy of the model by allocating an additional testing set. The following query splits the data into three sets. The `covertype_training` table is for training, the `covertype_validation` table is for validation, and the `covertype_test` table is for testing your model. You will use the training set to train your model and the validation set to validate the model’s development. Then, you use the testing set to test the performance of the model and see if the model is overfitting or underfitting the dataset.

   ```
   CREATE TABLE public.covertype_data_prep AS
   SELECT
       a.*,
       CAST (random() * 100 AS int) AS data_group_id
   FROM
       public.covertype_data a;
   
   --training dataset
    CREATE TABLE public.covertype_training as
   SELECT
       *
   FROM
       public.covertype_data_prep
   WHERE
       data_group_id < 80;
   
   --validation dataset
    CREATE TABLE public.covertype_validation AS
   SELECT
       *
   FROM
       public.covertype_data_prep
   WHERE
       data_group_id BETWEEN 80
       AND 89;
   
   --test dataset
    CREATE TABLE public.covertype_test AS
   SELECT
       *
   FROM
       public.covertype_data_prep
   WHERE
       data_group_id > 89;
   ```

## Step 2: Create the machine learning model
<a name="tutorial_linear_learner_multi-class_classification_step_create_model"></a>

In this step, you use the CREATE MODEL statement to create your machine learning model with the linear learner algorithm. 

The following query creates the linear learner model with the CREATE MODEL operation using your S3 bucket. Replace amzn-s3-demo-bucket with your own S3 bucket.

```
CREATE MODEL forest_cover_type_model
FROM
    (
        SELECT
            Elevation,
            Aspect,
            Slope,
            Horizontal_distance_to_hydrology,
            Vertical_distance_to_hydrology,
            Horizontal_distance_to_roadways,
            HIllshade_9am,
            Hillshade_noon,
            Hillshade_3pm,
            Horizontal_Distance_To_Fire_Points,
            Wilderness_Area1,
            Wilderness_Area2,
            Wilderness_Area3,
            Wilderness_Area4,
            soil_type1,
            Soil_Type2,
            Soil_Type3,
            Soil_Type4,
            Soil_Type5,
            Soil_Type6,
            Soil_Type7,
            Soil_Type8,
            Soil_Type9,
            Soil_Type10,
            Soil_Type11,
            Soil_Type12,
            Soil_Type13,
            Soil_Type14,
            Soil_Type15,
            Soil_Type16,
            Soil_Type17,
            Soil_Type18,
            Soil_Type19,
            Soil_Type20,
            Soil_Type21,
            Soil_Type22,
            Soil_Type23,
            Soil_Type24,
            Soil_Type25,
            Soil_Type26,
            Soil_Type27,
            Soil_Type28,
            Soil_Type29,
            Soil_Type30,
            Soil_Type31,
            Soil_Type32,
            Soil_Type33,
            Soil_Type34,
            Soil_Type36,
            Soil_Type37,
            Soil_Type38,
            Soil_Type39,
            Soil_Type40,
            Cover_type
        from
            public.covertype_training
    ) TARGET cover_type FUNCTION predict_cover_type IAM_ROLE default MODEL_TYPE LINEAR_LEARNER PROBLEM_TYPE MULTICLASS_CLASSIFICATION OBJECTIVE 'Accuracy' SETTINGS (
        S3_BUCKET 'amzn-s3-demo-bucket',
        S3_GARBAGE_COLLECT OFF,
        MAX_RUNTIME 15000
    );
```

### Show the status of model training (optional)
<a name="tutorial_linear_learner_multi-class_classification_show_status"></a>

You can use the SHOW MODEL command to know when your model is ready.

Use the following query to monitor the progress of the model training.

```
SHOW MODEL forest_cover_type_model;
```

When the model is ready, the output of the previous operation should look similar to the following example. Note that the output provides the `validation:multiclass_accuracy` metric, which you can view on the righthand side of the following example. Multi-class accuracy measures the percentage of data points that are classified correctly by the model. You will use multi-class accuracy to validate the accuracy of the model in the next step.

```
+--------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|              Key               |                                                                                                                                                                                                                                                                                                                                                                                                             Value                                                                                                                                                                                                                                                                                                                                                                                                              |
+--------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Model Name                     | forest_cover_type_model                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| Schema Name                    | public                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| Owner                          | awsuser                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| Creation Time                  | Tue, 12.07.2022 20:24:32                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| Model State                    | READY                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| validation:multiclass_accuracy |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       0.724952 |
| Estimated Cost                 |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       5.341750 |
|                                |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| TRAINING DATA:                 |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| Query                          | SELECT ELEVATION, ASPECT, SLOPE, HORIZONTAL_DISTANCE_TO_HYDROLOGY, VERTICAL_DISTANCE_TO_HYDROLOGY, HORIZONTAL_DISTANCE_TO_ROADWAYS, HILLSHADE_9AM, HILLSHADE_NOON, HILLSHADE_3PM , HORIZONTAL_DISTANCE_TO_FIRE_POINTS, WILDERNESS_AREA1, WILDERNESS_AREA2, WILDERNESS_AREA3, WILDERNESS_AREA4, SOIL_TYPE1, SOIL_TYPE2, SOIL_TYPE3, SOIL_TYPE4, SOIL_TYPE5, SOIL_TYPE6, SOIL_TYPE7, SOIL_TYPE8, SOIL_TYPE9, SOIL_TYPE10 , SOIL_TYPE11, SOIL_TYPE12 , SOIL_TYPE13 , SOIL_TYPE14, SOIL_TYPE15, SOIL_TYPE16, SOIL_TYPE17, SOIL_TYPE18, SOIL_TYPE19, SOIL_TYPE20, SOIL_TYPE21, SOIL_TYPE22, SOIL_TYPE23, SOIL_TYPE24, SOIL_TYPE25, SOIL_TYPE26, SOIL_TYPE27, SOIL_TYPE28, SOIL_TYPE29, SOIL_TYPE30, SOIL_TYPE31, SOIL_TYPE32, SOIL_TYPE33, SOIL_TYPE34, SOIL_TYPE36, SOIL_TYPE37, SOIL_TYPE38, SOIL_TYPE39, SOIL_TYPE40, COVER_TYPE |
|                                | FROM PUBLIC.COVERTYPE_TRAINING                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| Target Column                  | COVER_TYPE                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
|                                |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| PARAMETERS:                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| Model Type                     | linear_learner                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| Problem Type                   | MulticlassClassification                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| Objective                      | Accuracy                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| AutoML Job Name                | redshiftml-20220712202432187659                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| Function Name                  | predict_cover_type                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
| Function Parameters            | elevation aspect slope horizontal_distance_to_hydrology vertical_distance_to_hydrology horizontal_distance_to_roadways hillshade_9am hillshade_noon hillshade_3pm horizontal_distance_to_fire_points wilderness_area1 wilderness_area2 wilderness_area3 wilderness_area4 soil_type1 soil_type2 soil_type3 soil_type4 soil_type5 soil_type6 soil_type7 soil_type8 soil_type9 soil_type10 soil_type11 soil_type12 soil_type13 soil_type14 soil_type15 soil_type16 soil_type17 soil_type18 soil_type19 soil_type20 soil_type21 soil_type22 soil_type23 soil_type24 soil_type25 soil_type26 soil_type27 soil_type28 soil_type29 soil_type30 soil_type31 soil_type32 soil_type33 soil_type34 soil_type36 soil_type37 soil_type38 soil_type39 soil_type40                                                                            |
| Function Parameter Types       | int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8 int8                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| IAM Role                       | default-aws-iam-role                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| S3 Bucket                      | amzn-s3-demo-bucket                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| Max Runtime                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          15000 |
+--------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```

## Step 3: Validate the model
<a name="tutorial_linear_learner_multi-class_classification_step_validate"></a>

1. The following prediction query validates the accuracy of the model on the `covertype_validation` dataset by calculating multi-class accuracy. Multi-class accuracy is the percentage of the model’s predictions that are correct.

   ```
   SELECT
       CAST(sum(t1.match) AS decimal(7, 2)) AS predicted_matches,
       CAST(sum(t1.nonmatch) AS decimal(7, 2)) AS predicted_non_matches,
       CAST(sum(t1.match + t1.nonmatch) AS decimal(7, 2)) AS total_predictions,
       predicted_matches / total_predictions AS pct_accuracy
   FROM
       (
           SELECT
               Elevation,
               Aspect,
               Slope,
               Horizontal_distance_to_hydrology,
               Vertical_distance_to_hydrology,
               Horizontal_distance_to_roadways,
               HIllshade_9am,
               Hillshade_noon,
               Hillshade_3pm,
               Horizontal_Distance_To_Fire_Points,
               Wilderness_Area1,
               Wilderness_Area2,
               Wilderness_Area3,
               Wilderness_Area4,
               soil_type1,
               Soil_Type2,
               Soil_Type3,
               Soil_Type4,
               Soil_Type5,
               Soil_Type6,
               Soil_Type7,
               Soil_Type8,
               Soil_Type9,
               Soil_Type10,
               Soil_Type11,
               Soil_Type12,
               Soil_Type13,
               Soil_Type14,
               Soil_Type15,
               Soil_Type16,
               Soil_Type17,
               Soil_Type18,
               Soil_Type19,
               Soil_Type20,
               Soil_Type21,
               Soil_Type22,
               Soil_Type23,
               Soil_Type24,
               Soil_Type25,
               Soil_Type26,
               Soil_Type27,
               Soil_Type28,
               Soil_Type29,
               Soil_Type30,
               Soil_Type31,
               Soil_Type32,
               Soil_Type33,
               Soil_Type34,
               Soil_Type36,
               Soil_Type37,
               Soil_Type38,
               Soil_Type39,
               Soil_Type40,
               Cover_type AS actual_cover_type,
               predict_cover_type(
                   Elevation,
                   Aspect,
                   Slope,
                   Horizontal_distance_to_hydrology,
                   Vertical_distance_to_hydrology,
                   Horizontal_distance_to_roadways,
                   HIllshade_9am,
                   Hillshade_noon,
                   Hillshade_3pm,
                   Horizontal_Distance_To_Fire_Points,
                   Wilderness_Area1,
                   Wilderness_Area2,
                   Wilderness_Area3,
                   Wilderness_Area4,
                   soil_type1,
                   Soil_Type2,
                   Soil_Type3,
                   Soil_Type4,
                   Soil_Type5,
                   Soil_Type6,
                   Soil_Type7,
                   Soil_Type8,
                   Soil_Type9,
                   Soil_Type10,
                   Soil_Type11,
                   Soil_Type12,
                   Soil_Type13,
                   Soil_Type14,
                   Soil_Type15,
                   Soil_Type16,
                   Soil_Type17,
                   Soil_Type18,
                   Soil_Type19,
                   Soil_Type20,
                   Soil_Type21,
                   Soil_Type22,
                   Soil_Type23,
                   Soil_Type24,
                   Soil_Type25,
                   Soil_Type26,
                   Soil_Type27,
                   Soil_Type28,
                   Soil_Type29,
                   Soil_Type30,
                   Soil_Type31,
                   Soil_Type32,
                   Soil_Type33,
                   Soil_Type34,
                   Soil_Type36,
                   Soil_Type37,
                   Soil_Type38,
                   Soil_Type39,
                   Soil_Type40
               ) AS predicted_cover_type,
               CASE
                   WHEN actual_cover_type = predicted_cover_type THEN 1
                   ELSE 0
               END AS match,
               CASE
                   WHEN actual_cover_type <> predicted_cover_type THEN 1
                   ELSE 0
               END AS nonmatch
           FROM
               public.covertype_validation
       ) t1;
   ```

   The output of the previous query should look like the following example. The value of the multi-class accuracy metric should be similar to the `validation:multiclass_accuracy` metric shown by the SHOW MODEL operation’s output.

   ```
   +-------------------+-----------------------+-------------------+--------------+
   | predicted_matches | predicted_non_matches | total_predictions | pct_accuracy |
   +-------------------+-----------------------+-------------------+--------------+
   |             41211 |                 16324 |             57535 |   0.71627704 |
   +-------------------+-----------------------+-------------------+--------------+
   ```

1. The following query predicts the most common cover type for `wilderness_area2`. This dataset includes four wilderness areas and seven cover types. A wilderness area can have multiple cover types.

   ```
   SELECT t1. predicted_cover_type, COUNT(*) 
   FROM 
   (     
   SELECT 
      Elevation, 
      Aspect,  
      Slope,  
      Horizontal_distance_to_hydrology,  
      Vertical_distance_to_hydrology,   
      Horizontal_distance_to_roadways,  
      HIllshade_9am,  
      Hillshade_noon,  
      Hillshade_3pm , 
      Horizontal_Distance_To_Fire_Points,  
      Wilderness_Area1,  
      Wilderness_Area2,   
      Wilderness_Area3,  
      Wilderness_Area4,  
      soil_type1,
      Soil_Type2,
      Soil_Type3, 
      Soil_Type4, 
      Soil_Type5,  
      Soil_Type6, 
      Soil_Type7,  
      Soil_Type8, 
      Soil_Type9, 
      Soil_Type10 , 
      Soil_Type11,  
      Soil_Type12 , 
      Soil_Type13 ,
      Soil_Type14, 
      Soil_Type15, 
      Soil_Type16,  
      Soil_Type17, 
      Soil_Type18,  
      Soil_Type19,  
      Soil_Type20,  
      Soil_Type21,  
      Soil_Type22,  
      Soil_Type23,  
      Soil_Type24, 
      Soil_Type25,  
      Soil_Type26, 
      Soil_Type27,  
      Soil_Type28,  
      Soil_Type29,  
      Soil_Type30,  
      Soil_Type31, 
      Soil_Type32, 
      Soil_Type33,  
      Soil_Type34, 
      Soil_Type36, 
      Soil_Type37,  
      Soil_Type38,  
      Soil_Type39, 
      Soil_Type40,
      predict_cover_type( Elevation, 
      Aspect,  
      Slope,  
      Horizontal_distance_to_hydrology,  
      Vertical_distance_to_hydrology,   
      Horizontal_distance_to_roadways,  
      HIllshade_9am,  
      Hillshade_noon,  
      Hillshade_3pm , 
      Horizontal_Distance_To_Fire_Points,  
      Wilderness_Area1,  
      Wilderness_Area2,   
      Wilderness_Area3,  
      Wilderness_Area4,  
      soil_type1,
      Soil_Type2,
      Soil_Type3, 
      Soil_Type4, 
      Soil_Type5,  
      Soil_Type6, 
      Soil_Type7,  
      Soil_Type8, 
      Soil_Type9, 
      Soil_Type10,
      Soil_Type11,  
      Soil_Type12, 
      Soil_Type13,
      Soil_Type14, 
      Soil_Type15, 
      Soil_Type16,  
      Soil_Type17, 
      Soil_Type18,  
      Soil_Type19,  
      Soil_Type20,  
      Soil_Type21,  
      Soil_Type22,  
      Soil_Type23,  
      Soil_Type24, 
      Soil_Type25,  
      Soil_Type26, 
      Soil_Type27,  
      Soil_Type28,  
      Soil_Type29,  
      Soil_Type30,  
      Soil_Type31, 
      Soil_Type32, 
      Soil_Type33,  
      Soil_Type34, 
      Soil_Type36, 
      Soil_Type37,  
      Soil_Type38,  
      Soil_Type39, 
      Soil_Type40) AS predicted_cover_type 
   
   FROM public.covertype_test
   WHERE wilderness_area2 = 1)
   t1
   GROUP BY 1;
   ```

   The output of the previous operation should look similar to the following example. This output means that the model predicted that the majority of cover is cover type 1, and there is some cover of cover types 2 and 7.

   ```
   +----------------------+-------+
   | predicted_cover_type | count |
   +----------------------+-------+
   |                    2 |   564 |
   |                    7 |    97 |
   |                    1 |  2309 |
   +----------------------+-------+
   ```

1. The following query shows the most common cover type in a single wilderness area. The query displays the amount of that cover type and the cover type’s wilderness area.

   ```
   SELECT t1. predicted_cover_type, COUNT(*), wilderness_area 
   FROM 
   (     
   SELECT 
      Elevation, 
      Aspect,  
      Slope,  
      Horizontal_distance_to_hydrology,  
      Vertical_distance_to_hydrology,   
      Horizontal_distance_to_roadways,  
      HIllshade_9am,  
      Hillshade_noon,  
      Hillshade_3pm , 
      Horizontal_Distance_To_Fire_Points,  
      Wilderness_Area1,  
      Wilderness_Area2,   
      Wilderness_Area3,  
      Wilderness_Area4,  
      soil_type1,
      Soil_Type2,
      Soil_Type3, 
      Soil_Type4, 
      Soil_Type5,  
      Soil_Type6, 
      Soil_Type7,  
      Soil_Type8, 
      Soil_Type9, 
      Soil_Type10 , 
      Soil_Type11,  
      Soil_Type12 , 
      Soil_Type13 ,
      Soil_Type14, 
      Soil_Type15, 
      Soil_Type16,  
      Soil_Type17, 
      Soil_Type18,  
      Soil_Type19,  
      Soil_Type20,  
      Soil_Type21,  
      Soil_Type22,  
      Soil_Type23,  
      Soil_Type24, 
      Soil_Type25,  
      Soil_Type26, 
      Soil_Type27,  
      Soil_Type28,  
      Soil_Type29,  
      Soil_Type30,  
      Soil_Type31, 
      Soil_Type32, 
      Soil_Type33,  
      Soil_Type34, 
      Soil_Type36, 
      Soil_Type37,  
      Soil_Type38,  
      Soil_Type39, 
      Soil_Type40,
      predict_cover_type( Elevation, 
      Aspect,  
      Slope,  
      Horizontal_distance_to_hydrology,  
      Vertical_distance_to_hydrology,   
      Horizontal_distance_to_roadways,  
      HIllshade_9am,  
      Hillshade_noon,  
      Hillshade_3pm , 
      Horizontal_Distance_To_Fire_Points,  
      Wilderness_Area1,  
      Wilderness_Area2,   
      Wilderness_Area3,  
      Wilderness_Area4,  
      soil_type1,
      Soil_Type2,
      Soil_Type3, 
      Soil_Type4, 
      Soil_Type5,  
      Soil_Type6, 
      Soil_Type7,  
      Soil_Type8, 
      Soil_Type9, 
      Soil_Type10,
      Soil_Type11,  
      Soil_Type12, 
      Soil_Type13,
      Soil_Type14, 
      Soil_Type15, 
      Soil_Type16,  
      Soil_Type17, 
      Soil_Type18,  
      Soil_Type19,  
      Soil_Type20,  
      Soil_Type21,  
      Soil_Type22,  
      Soil_Type23,  
      Soil_Type24, 
      Soil_Type25,  
      Soil_Type26, 
      Soil_Type27,  
      Soil_Type28,  
      Soil_Type29,  
      Soil_Type30,  
      Soil_Type31, 
      Soil_Type32, 
      Soil_Type33,  
      Soil_Type34, 
      Soil_Type36, 
      Soil_Type37,  
      Soil_Type38,  
      Soil_Type39, 
      Soil_Type40) AS predicted_cover_type,
      CASE WHEN Wilderness_Area1 = 1 THEN 1
           WHEN Wilderness_Area2 = 1 THEN 2
           WHEN Wilderness_Area3 = 1 THEN 3
           WHEN Wilderness_Area4 = 1 THEN 4
           ELSE 0
      END AS wilderness_area
   
   FROM public.covertype_test)
   t1
   GROUP BY 1, 3
   ORDER BY 2 DESC
   LIMIT 1;
   ```

   The output of the previous operation should look similar to the following example.

   ```
   +----------------------+-------+-----------------+
   | predicted_cover_type | count | wilderness_area |
   +----------------------+-------+-----------------+
   |                    2 | 15738 |               1 |
   +----------------------+-------+-----------------+
   ```

## Related topics
<a name="tutorial_linear_learner_multi-class_classification_related_topics"></a>

For more information about Amazon Redshift ML, see the following documentation:
+ [Costs for using Amazon Redshift ML](https://docs.aws.amazon.com/redshift/latest/dg/cost.html)
+ [CREATE MODEL operation](https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_MODEL.html)
+ [EXPLAIN\$1MODEL function](https://docs.aws.amazon.com/redshift/latest/dg/r_explain_model_function.html)

For more information about machine learning, see the following documentation:
+ [Machine learning overview](https://docs.aws.amazon.com/redshift/latest/dg/machine_learning_overview.html)
+ [Machine learning for novices and experts](https://docs.aws.amazon.com/redshift/latest/dg/novice_expert.html)
+ [What Is Fairness and Model Explainability for Machine Learning Predictions?](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-fairness-and-explainability.html)

# Amazon Redshift ML integration with Amazon Bedrock
<a name="machine-learning-br"></a>

This section describes how to use Amazon Redshift ML integration with Amazon Bedrock. With this feature, you can invoke an Amazon Bedrock model using SQL, and you can use your data from a Amazon Redshift data warehouse to build generative AI applications such as text generation, sentiment analysis, or translation.

**Topics**
+ [Creating or updating an IAM role for Amazon Redshift ML integration with Amazon Bedrock](#machine-learning-br-iam)
+ [Creating an external model for Amazon Redshift ML integration with Amazon Bedrock](#machine-learning-br-create)
+ [Using an external model for Amazon Redshift ML integration with Amazon Bedrock](#machine-learning-br-use)
+ [Prompt engineering for Amazon Redshift ML integration with Amazon Bedrock](#machine-learning-br-prompt)

## Creating or updating an IAM role for Amazon Redshift ML integration with Amazon Bedrock
<a name="machine-learning-br-iam"></a>

This section demonstrates how to create an IAM role to use with Amazon Redshift ML integration with Amazon Bedrock.

Add the following policy to the IAM role you use with Amazon Redshift ML integration with Amazon Bedrock: 
+ `AmazonBedrockFullAccess`

To allow Amazon Redshift to assume a role to interact with other services, add the following trust policy to the IAM role:

------
#### [ JSON ]

****  

```
{
   "Version":"2012-10-17",		 	 	 
   "Statement": [
      {
         "Effect": "Allow",
         "Principal": {
            "Service": [
               "redshift.amazonaws.com"
            ]
         },
         "Action": "sts:AssumeRole"
      }
   ]
}
```

------

If the cluster or namespace is in a VPC, follow the instructions in [Cluster and configure setup for Amazon Redshift ML administration](getting-started-machine-learning.md#admin-setup). 

If you need a more restrictive policy, you can create one that includes only the Amazon Bedrock permissions specified in the following pages:
+ [Cluster and configure setup for Amazon Redshift ML administration](getting-started-machine-learning.md#admin-setup)
+ [ Permissions required to use Amazon Redshift machine learning (ML)](https://docs.aws.amazon.com/redshift/latest/mgmt/redshift-iam-access-control-identity-based.html#iam-permission-ml?)

For information about creating an IAM role, see [IAM Role Creation](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create.html) in the *AWS Identity and Access Management User Guide*.

## Creating an external model for Amazon Redshift ML integration with Amazon Bedrock
<a name="machine-learning-br-create"></a>

This section shows how to create an external model to use as an interface for Amazon Bedrock within your Amazon Redshift data warehouse.

To invoke an Amazon Bedrock model from Amazon Redshift, you must first run the `CREATE EXTERNAL MODEL` command. This command creates an external model object in the database, and an associated user function that you use to generate text content with Amazon Bedrock.

The following code example shows a basic `CREATE EXTERNAL MODEL` command:

```
CREATE EXTERNAL MODEL llm_claude
FUNCTION llm_claude_func
IAM_ROLE '<IAM role arn>'
MODEL_TYPE BEDROCK
SETTINGS (
   MODEL_ID 'anthropic.claude-v2:1',
   PROMPT 'Summarize the following text:');
```

The `CREATE EXTERNAL MODEL` command has a unified and consistent interface with Amazon Bedrock for all Foundation Models (FM) that support messages. This is the default option when using the `CREATE EXTERNAL MODEL` command or when explicitly specifying the request type to be `UNIFIED`. For more information, see the [Converse API documentation](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_runtime_Converse.html) in the *Amazon Bedrock API documentation*.

If an FM doesn't support messages, then you must set the `request_type` setting to `RAW`, as the following example demonstrates. When you set `request_type` to `RAW`, you must construct the request sent to Amazon Bedrock when using the inference function based on the selected FM. Make sure that you enable access to the Titan Text G1 – Express model in Amazon Bedrock before running the following example. 

```
CREATE EXTERNAL MODEL titan_raw
FUNCTION func_titan_raw
IAM_ROLE '<IAM role arn>'
MODEL_TYPE BEDROCK
SETTINGS (
   MODEL_ID 'amazon.titan-text-express-v1',
   REQUEST_TYPE RAW,
   RESPONSE_TYPE SUPER);
```

If you need more information about an input request such as total tokens, you can request the `RESPONSE_TYPE` to be `super` when you create the model. 

```
CREATE EXTERNAL MODEL patient_recommendations_v2
FUNCTION patient_recommendations_func_v2
IAM_ROLE '<IAM role arn>'
MODEL_TYPE BEDROCK
SETTINGS (
   MODEL_ID 'anthropic.claude-v2',
   PROMPT 'Generate personalized diet plan for following patient:',
   RESPONSE_TYPE SUPER);
```

The `PROMPT` parameter for the `CREATE EXTERNAL MODEL` command is a static prompt. If you need a dynamic prompt for your application, you must specify it when using the inference function. For more details, see [Prompt engineering for Amazon Redshift ML integration with Amazon Bedrock](#machine-learning-br-prompt).

For more information about the `CREATE EXTERNAL MODEL` statement and its parameters and settings, see [CREATE EXTERNAL MODEL](r_create_external_model.md).

## Using an external model for Amazon Redshift ML integration with Amazon Bedrock
<a name="machine-learning-br-use"></a>

This section shows how to invoke an external model to generate text in response to provided prompts. To invoke an external model, use the inference function that you create with `CREATE EXTERNAL MODEL`.

**Topics**
+ [Inference with `UNIFIED` request type models](#machine-learning-br-use-unified)
+ [Inference with `RAW` request type models](#machine-learning-br-use-raw)
+ [Inference functions as leader-only functions](#machine-learning-br-use-leader)
+ [Inference function usage notes](#machine-learning-br-use-usage)

### Inference with `UNIFIED` request type models
<a name="machine-learning-br-use-unified"></a>

The inference function for models with request type `UNIFIED` has the following three parameters that are passed to the function in order:
+ **Input text** (required): This parameter specifies the input text that Amazon Redshift passes to Amazon Bedrock.
+ **Inference configuration** and **Additional model request fields** (optional): Amazon Redshift passes these parameters to the corresponding parameters for the Converse model API. 

The following code example shows how to use a `UNIFIED` type inference function:

```
SELECT llm_claude_func(input_text, object('temperature', 0.7, 'maxtokens', 500))
   FROM some_data;
```

### Inference with `RAW` request type models
<a name="machine-learning-br-use-raw"></a>

The inference function for models with request type `RAW` has only one parameter of data type `SUPER`. The syntax of this parameter depends on the Amazon Bedrock model used.

The following code example shows how to use a `RAW` type inference function:

```
SELECT llm_titan_func(
    object(
        "inputText", "Summarize the following text: " | input_text,
        "textGenerationConfig", object("temperature", 0.5, "maxTokenCount", 500)
    )
)
FROM some_data;
```

### Inference functions as leader-only functions
<a name="machine-learning-br-use-leader"></a>

Inference functions for Amazon Bedrock models can run as leader node-only functions when the query that uses them doesn't reference any tables. This can be helpful if you want to quickly ask an LLM a question.

The following code example shows how to use a leader-only inference function:

```
SELECT general_titan_llm_func('Summarize the benefits of LLM on data analytics in 100 words');
```

### Inference function usage notes
<a name="machine-learning-br-use-usage"></a>

Note the following when using inference functions with Amazon Redshift ML integration with Amazon Bedrock:
+ The names of the parameters for all Amazon Bedrock models are case sensitive. If your parameters do not match the ones required by the model, Amazon Bedrock might quietly ignore them.
+ The throughput of inference queries is limited by the runtime quotas of the different models offered by Amazon Bedrock in different regions. For more information, see [Quotas for Amazon Bedrock](https://docs.aws.amazon.com/bedrock/latest/userguide/quotas.html) in the *Amazon Bedrock User Guide*.
+ If you need guaranteed and consistent throughput, consider getting provisioned throughput for the model you need from Amazon Bedrock. For more information, see [ Increase model invocation capacity with Provisioned Throughput in Amazon Bedrock](https://docs.aws.amazon.com/bedrock/latest/userguide/prov-throughput.html) in the *Amazon Bedrock User Guide*.
+ Inference queries with large amounts of data might get throttling exceptions. This is because of the limited runtime quotas for Amazon Bedrock. Amazon Redshift retries requests multiple times, but queries can still get throttled because throughput for non-provisioned models might be variable.
+ If you encounter throttling exceptions coming from Amazon Bedrock such as `Too many requests, please wait before trying again.` even with small amounts of data, check the quotas under **Service Quotas** in your Amazon Bedrock account. Check that the applied account-level quota is at least the same as the AWS default quota value for the **InvokeModel** requests for the model you are using.

## Prompt engineering for Amazon Redshift ML integration with Amazon Bedrock
<a name="machine-learning-br-prompt"></a>

This section shows how to use static prompts with an external model.

To use static prefix and suffix prompts for your external model, provide them using the `PROMPT` and `SUFFIX` parameters of the `CREATE EXTERNAL MODEL` statement. These prompts are added to every query using the external model.

The following example shows how to add prefix and suffix prompts to an external model:

```
CREATE EXTERNAL MODEL llm_claude
FUNCTION llm_claude_func
IAM_ROLE '<IAM role arn>'
MODEL_TYPE BEDROCK
SETTINGS (
   MODEL_ID 'anthropic.claude-v2:1',
   PROMPT 'Summarize the following text:',
   SUFFIX 'Respond in an analytic tone');
```

To use dynamic prompts, you can provide them when using the inference function by concatenating them in the function input. The following example shows how to use dynamic prompts with an inference function:

```
SELECT llm_claude_func('Summarize the following review:' | input_text | 'The review should have formal tone.')
FROM some_data
```