Download, Prepare, and Upload Training Data
For this example, you use a training dataset of information about bank customers that includes the customer's job, marital status, and how they were contacted during the bank's direct marketing campaign. To use a dataset for a hyperparameter tuning job, you download it, transform the data, and then upload it to an Amazon S3 bucket.
For more information about the dataset and the data transformation that the example performs, see the hpo_xgboost_direct_marketing_sagemaker_APIs notebook in the Hyperparameter Tuning section of the SageMaker Examples tab in your notebook instance.
Download and Explore the Training Dataset
To download and explore the dataset, run the following code in your notebook:
!wget -N https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip !unzip -o bank-additional.zip data = pd.read_csv('./bank-additional/bank-additional-full.csv', sep=';') pd.set_option('display.max_columns', 500) # Make sure we can see all of the columns pd.set_option('display.max_rows', 5) # Keep the output on one page data
Prepare and Upload Data
Before creating the hyperparameter tuning job, prepare the data and upload it to an S3 bucket where the hyperparameter tuning job can access it.
Run the following code in your notebook:
data['no_previous_contact'] = np.where(data['pdays'] == 999, 1, 0) # Indicator variable to capture when pdays takes a value of 999 data['not_working'] = np.where(np.in1d(data['job'], ['student', 'retired', 'unemployed']), 1, 0) # Indicator for individuals not actively employed model_data = pd.get_dummies(data) # Convert categorical variables to sets of indicators model_data model_data = model_data.drop(['duration', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed'], axis=1) train_data, validation_data, test_data = np.split(model_data.sample(frac=1, random_state=1729), [int(0.7 * len(model_data)), int(0.9*len(model_data))]) pd.concat([train_data['y_yes'], train_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('train.csv', index=False, header=False) pd.concat([validation_data['y_yes'], validation_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('validation.csv', index=False, header=False) pd.concat([test_data['y_yes'], test_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('test.csv', index=False, header=False) boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv') boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')
Next Step
Configure and Launch a Hyperparameter Tuning Job