Fraud detection with Feature Store example notebook
Important
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see Provide permissions for tagging SageMaker AI resources.
AWS managed policies for Amazon SageMaker AI that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.
The example code on this page refers to the example notebook: Fraud Detection with Amazon SageMaker Feature Store
Use the following to clone the aws/amazon-sagemaker-examples
-
For Studio Classic
First launch Studio Classic. You can open Studio Classic if Studio or Studio Classic is enabled as your default experience. To open Studio Classic, see Launch Studio Classic Using the Amazon SageMaker AI Console.
Clone the aws/amazon-sagemaker-examples
GitHub repository to Studio Classic by following the steps in Clone a Git Repository in SageMaker Studio Classic. -
For Amazon SageMaker notebook instances
First launch SageMaker notebook instance by following the instructions in Access Notebook Instances.
Check if the examples are already in your notebooks by following the instructions in Access example notebooks. If not, follow the instructions in Add a Git repository to your Amazon SageMaker AI account.
Now that you have the SageMaker AI example notebooks, navigate to the
amazon-sagemaker-examples/sagemaker-featurestore
directory and open the Fraud Detection with Amazon SageMaker Feature Store
Step 1: Set up your Feature Store session
To start using Feature Store, create a SageMaker AI session, Boto3 session, and a Feature Store session. Also, set up the Amazon S3 bucket you want to use for your features. This is your offline store. The following code uses the SageMaker AI default bucket and adds a custom prefix to it.
Note
The role that you use to run the notebook must have the following managed policies
attached to it: AmazonSageMakerFullAccess
and
AmazonSageMakerFeatureStoreAccess
. For information about adding policies to your
IAM role, see Adding policies to your IAM role.
import boto3 import sagemaker from sagemaker.session import Session sagemaker_session = sagemaker.Session() region = sagemaker_session.boto_region_name boto_session = boto3.Session(region_name=region) role = sagemaker.get_execution_role() default_bucket = sagemaker_session.default_bucket() prefix = 'sagemaker-featurestore' offline_feature_store_bucket = 's3://{}/{}'.format(default_bucket, prefix) sagemaker_client = boto_session.client(service_name='sagemaker', region_name=region) featurestore_runtime = boto_session.client(service_name='sagemaker-featurestore-runtime', region_name=region) feature_store_session = Session( boto_session=boto_session, sagemaker_client=sagemaker_client, sagemaker_featurestore_runtime_client=featurestore_runtime )
Step 2: Load datasets and partition data into feature groups
Load your data into data frames for each of your features. You use these data frames after you set up the feature group. In the fraud detection example, you can see these steps in the following code.
import numpy as np import pandas as pd import matplotlib.pyplot as plt import io s3_client = boto3.client(service_name='s3', region_name=region) fraud_detection_bucket_name = 'sagemaker-featurestore-fraud-detection' identity_file_key = 'sampled_identity.csv' transaction_file_key = 'sampled_transactions.csv' identity_data_object = s3_client.get_object(Bucket=fraud_detection_bucket_name, Key=identity_file_key) transaction_data_object = s3_client.get_object(Bucket=fraud_detection_bucket_name, Key=transaction_file_key) identity_data = pd.read_csv(io.BytesIO(identity_data_object['Body'].read())) transaction_data = pd.read_csv(io.BytesIO(transaction_data_object['Body'].read())) identity_data = identity_data.round(5) transaction_data = transaction_data.round(5) identity_data = identity_data.fillna(0) transaction_data = transaction_data.fillna(0) # Feature transformations for this dataset are applied before ingestion into FeatureStore. # One hot encode card4, card6 encoded_card_bank = pd.get_dummies(transaction_data['card4'], prefix = 'card_bank') encoded_card_type = pd.get_dummies(transaction_data['card6'], prefix = 'card_type') transformed_transaction_data = pd.concat([transaction_data, encoded_card_type, encoded_card_bank], axis=1) transformed_transaction_data = transformed_transaction_data.rename(columns={"card_bank_american express": "card_bank_american_express"})
Step 3: Set up feature groups
When you set up your feature groups, you need to customize the feature names with a unique
name and set up each feature group with the FeatureGroup
class.
from sagemaker.feature_store.feature_group import FeatureGroup feature_group_name = "some string for a name" feature_group = FeatureGroup(name=feature_group_name, sagemaker_session=feature_store_session)
For example, in the fraud detection example, the two feature groups are
identity
and transaction
. In the following code you can see how
the names are customized with a timestamp, and then each group is set up by passing in the
name and the session.
import time from time import gmtime, strftime, sleep from sagemaker.feature_store.feature_group import FeatureGroup identity_feature_group_name = 'identity-feature-group-' + strftime('%d-%H-%M-%S', gmtime()) transaction_feature_group_name = 'transaction-feature-group-' + strftime('%d-%H-%M-%S', gmtime()) identity_feature_group = FeatureGroup(name=identity_feature_group_name, sagemaker_session=feature_store_session) transaction_feature_group = FeatureGroup(name=transaction_feature_group_name, sagemaker_session=feature_store_session)
Step 4: Set up record identifier and event time features
In this step, you specify a record identifier name and an event time feature name. This
name maps to the column of the corresponding features in your data. For example, in the fraud
detection example, the column of interest is TransactionID
.
EventTime
can be appended to your data when no timestamp is available. In the
following code, you can see how these variables are set, and then EventTime
is
appended to both feature’s data.
record_identifier_name = "TransactionID" event_time_feature_name = "EventTime" current_time_sec = int(round(time.time())) identity_data[event_time_feature_name] = pd.Series([current_time_sec]*len(identity_data), dtype="float64") transformed_transaction_data[event_time_feature_name] = pd.Series([current_time_sec]*len(transaction_data), dtype="float64")
Step 5: Load feature definitions
You can now load the feature definitions by passing a data frame containing the feature
data. In the following code for the fraud detection example, the identity feature and
transaction feature are each loaded by using load_feature_definitions
, and this
function automatically detects the data type of each column of data. For developers using a
schema rather than automatic detection, see the Export Feature Groups from Data Wrangler example for code that shows how to load
the schema, map it, and add it as a FeatureDefinition
that you can use to create
the FeatureGroup
. This example also covers a AWS SDK for Python (Boto3) implementation, which
you can use instead of the SageMaker Python SDK.
identity_feature_group.load_feature_definitions(data_frame=identity_data); # output is suppressed transaction_feature_group.load_feature_definitions(data_frame=transformed_transaction_data); # output is suppressed
Step 6: Create a feature group
In this step, you use the create
function to create the feature group. The
following code shows all of the available parameters. The online store is not created by
default, so you must set this as True
if you want to enable it. The
s3_uri
is the S3 bucket location of your offline store.
# create a FeatureGroup feature_group.create( description = "Some info about the feature group", feature_group_name = feature_group_name, record_identifier_name = record_identifier_name, event_time_feature_name = event_time_feature_name, feature_definitions = feature_definitions, role_arn = role, s3_uri = offline_feature_store_bucket, enable_online_store = True, online_store_kms_key_id = None, offline_store_kms_key_id = None, disable_glue_table_creation = False, data_catalog_config = None, tags = ["tag1","tag2"])
The following code from the fraud detection example shows a minimal
create
call for each of the two features groups being created.
identity_feature_group.create( s3_uri=offline_feature_store_bucket, record_identifier_name=record_identifier_name, event_time_feature_name=event_time_feature_name, role_arn=role, enable_online_store=True ) transaction_feature_group.create( s3_uri=offline_feature_store_bucket, record_identifier_name=record_identifier_name, event_time_feature_name=event_time_feature_name, role_arn=role, enable_online_store=True )
When you create a feature group, it takes time to load the data, and you need to wait until the feature group is created before you can use it. You can check status using the following method.
status = feature_group.describe().get("FeatureGroupStatus")
While the feature group is being created, you receive Creating
as a response.
When this step has finished successfully, the response is Created
. Other possible
statuses are CreateFailed
, Deleting
, or
DeleteFailed
.
Step 7: Work with feature groups
Now that you've set up your feature group, you can perform any of the following tasks:
Topics
Describe a feature group
You can retrieve information about your feature group with the
describe
function.
feature_group.describe()
List feature groups
You can list all of your feature groups with the
list_feature_groups
function.
sagemaker_client.list_feature_groups()
Put records in a feature group
You can use the ingest
function to load your feature data. You pass in a
data frame of feature data, set the number of workers, and choose to wait for it to return
or not. The following example demonstrates using the ingest
function.
feature_group.ingest( data_frame=feature_data, max_workers=3, wait=True )
For each feature group you have, run the ingest
function on the feature
data you want to load.
Get records from a feature group
You can use the get_record
function to retrieve the data for a specific
feature by its record identifier. The following example uses an example identifier to
retrieve the record.
record_identifier_value = str(2990130) featurestore_runtime.get_record(FeatureGroupName=transaction_feature_group_name, RecordIdentifierValueAsString=record_identifier_value)
An example response from the fraud detection example:
... 'Record': [{'FeatureName': 'TransactionID', 'ValueAsString': '2990130'}, {'FeatureName': 'isFraud', 'ValueAsString': '0'}, {'FeatureName': 'TransactionDT', 'ValueAsString': '152647'}, {'FeatureName': 'TransactionAmt', 'ValueAsString': '75.0'}, {'FeatureName': 'ProductCD', 'ValueAsString': 'H'}, {'FeatureName': 'card1', 'ValueAsString': '4577'}, ...
Generate hive DDL commands
The SageMaker Python SDK’s FeatureStore
class also provides the functionality to
generate Hive DDL commands. The schema of the table is generated based on the feature
definitions. Columns are named after feature name and data-type are inferred based on
feature type.
print(feature_group.as_hive_ddl())
Example output:
CREATE EXTERNAL TABLE IF NOT EXISTS sagemaker_featurestore.identity-feature-group-27-19-33-00 ( TransactionID INT id_01 FLOAT id_02 FLOAT id_03 FLOAT id_04 FLOAT ...
Build a training dataset
Feature Store automatically builds an AWS Glue data catalog when you create feature groups and you can turn this off if you want. The following describes how to create a single training dataset with feature values from both identity and transaction feature groups created earlier in this topic. Also, the following describes how to run an Amazon Athena query to join data stored in the offline store from both identity and transaction feature groups.
To start, create an Athena query using athena_query()
for both identity and
transaction feature groups. The `table_name` is the AWS Glue table that is autogenerated by
Feature Store.
identity_query = identity_feature_group.athena_query() transaction_query = transaction_feature_group.athena_query() identity_table = identity_query.table_name transaction_table = transaction_query.table_name
Write and execute an Athena query
You write your query using SQL on these feature groups, and then execute the query with
the .run()
command and specify your Amazon S3 bucket location for the data set to be
saved there.
# Athena query query_string = 'SELECT * FROM "'+transaction_table+'" LEFT JOIN "'+identity_table+'" ON "'+transaction_table+'".transactionid = "'+identity_table+'".transactionid' # run Athena query. The output is loaded to a Pandas dataframe. dataset = pd.DataFrame() identity_query.run(query_string=query_string, output_location='s3://'+default_s3_bucket_name+'/query_results/') identity_query.wait() dataset = identity_query.as_dataframe()
From here you can train a model using this data set and then perform inference.
Delete a feature group
You can delete a feature group with the delete
function.
feature_group.delete()
The following code example is from the fraud detection example.
identity_feature_group.delete() transaction_feature_group.delete()
For more information, see the Delete a feature group API.