# Input data
<a name="sms-data-input"></a>

The input data are the data objects that you send to your workforce to be labeled. There are two ways to send data objects to Ground Truth for labeling: 
+ Send a list of data objects that require labeling using an input manifest file.
+ Send individual data objects in real time to a perpetually running, streaming labeling job. 

If you have a dataset that needs to be labeled one time, and you do not require an ongoing labeling job, create a standard labeling job using an input manifest file. 

If you want to regularly send new data objects to your labeling job after it has started, create a streaming labeling job. When you create a streaming labeling job, you can optionally use an input manifest file to specify a group of data that you want labeled immediately when the job starts. You can continuously send new data objects to a streaming labeling job as long as it is active. 

**Note**  
Streaming labeling jobs are only supported through the SageMaker API. You cannot create a streaming labeling job using the SageMaker AI console.

The following task types have special input data requirements and options:
+ For [3D point cloud](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-point-cloud.html) labeling job input data requirements, see [3D Point Cloud Input Data](sms-point-cloud-input-data.md). 
+ For [video frame](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-video-task-types.html) labeling job input data requirements, see [Video Frame Input Data](sms-video-frame-input-data-overview.md).

**Topics**
+ [Input manifest files](sms-input-data-input-manifest.md)
+ [Automate data setup for labeling jobs](sms-console-create-manifest-file.md)
+ [Supported data formats](sms-supported-data-formats.md)
+ [Ground Truth streaming labeling jobs](sms-streaming-labeling-job.md)
+ [Input Data Quotas](input-data-limits.md)
+ [Select Data for Labeling](sms-data-filtering.md)

# Input manifest files
<a name="sms-input-data-input-manifest"></a>

Each line in an input manifest file is an entry containing an object, or a reference to an object, to label. An entry can also contain labels from previous jobs and for some task types, additional information. 

Input data and the manifest file must be stored in Amazon Simple Storage Service (Amazon S3). Each has specific storage and access requirements, as follows:
+ The Amazon S3 bucket that contains the input data must be in the same AWS Region in which you are running Amazon SageMaker Ground Truth. You must give Amazon SageMaker AI access to the data stored in the Amazon S3 bucket so that it can read it. For more information about Amazon S3 buckets, see [ Working with Amazon S3 buckets](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingBucket.html). 
+ The manifest file must be in the same AWS Region as the data files, but it doesn't need to be in the same location as the data files. It can be stored in any Amazon S3 bucket that is accessible to the AWS Identity and Access Management (IAM) role that you assigned to Ground Truth when you created the labeling job.

**Note**  
3D point cloud and video frame [ task types](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-task-types.html) have different input manifest requirements and attributes.   
For [3D point cloud task types](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-point-cloud.html), refer to [Input Manifest Files for 3D Point Cloud Labeling Jobs](sms-point-cloud-input-manifest.md).  
For [video frame task types](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-video-task-types.html), refer to [Create a Video Frame Input Manifest File](sms-video-manual-data-setup.md#sms-video-create-manifest).

The manifest is a UTF-8 encoded file in which each line is a complete and valid JSON object. Each line is delimited by a standard line break, \$1n or \$1r\$1n. Because each line must be a valid JSON object, you can't have unescaped line break characters. For more information about data format, see [JSON Lines](http://jsonlines.org/).

Each JSON object in the manifest file can be no larger than 100,000 characters. No single attribute within an object can be larger than 20,000 characters. Attribute names can't begin with `$` (dollar sign).

Each JSON object in the manifest file must contain one of the following keys: `source-ref` or `source`. The value of the keys are interpreted as follows:
+ `source-ref` – The source of the object is the Amazon S3 object specified in the value. Use this value when the object is a binary object, such as an image.
+ `source` – The source of the object is the value. Use this value when the object is a text value.


The following is an example of a manifest file for files stored in an Amazon S3 bucket:

```
{"source-ref": "S3 bucket location 1"}
{"source-ref": "S3 bucket location 2"}
   ...
{"source-ref": "S3 bucket location n"}
```

Use the `source-ref` key for image files for bounding box, image classification (single and multi-label), semantic segmentation, and video clips for video classification labeling jobs. 3D point cloud and video frame labeling jobs also use the `source-ref` key but these labeling jobs require additional information in the input manifest file. For more information see [3D Point Cloud Input Data](sms-point-cloud-input-data.md) and [Video Frame Input Data](sms-video-frame-input-data-overview.md).

The following is an example of a manifest file with the input data stored in the manifest:

```
{"source": "Lorem ipsum dolor sit amet"}
{"source": "consectetur adipiscing elit"}
   ...
{"source": "mollit anim id est laborum"}
```

Use the `source` key for single and multi-label text classification and named entity recognition labeling jobs. 

You can include other key-value pairs in the manifest file. These pairs are passed to the output file unchanged. This is useful when you want to pass information between your applications. For more information, see [Labeling job output data](sms-data-output.md).

# Automate data setup for labeling jobs
<a name="sms-console-create-manifest-file"></a>

You can use the automated data setup to create manifest files for your labeling jobs in the Ground Truth console using images, videos, video frames, text (.txt) files, and comma-separated value (.csv) files stored in Amazon S3. When you use automated data setup, you specify an Amazon S3 location where your input data is stored and the input data type, and Ground Truth looks for the files that match that type in the location you specify.

**Note**  
Ground Truth does not use an AWS KMS key to access your input data or write the input manifest file in the Amazon S3 location that you specify. The user or role that creates the labeling job must have permissions to access your input data objects in Amazon S3.

Before using the following procedure, ensure that your input images or files are correctly formatted:
+ Image files – Image files must comply with the size and resolution limits listed in the tables found in [Input File Size Quota](input-data-limits.md#input-file-size-limit). 
+ Text files – Text data can be stored in one or more .txt files. Each item that you want labeled must be separated by a standard line break. 
+ CSV files – Text data can be stored in one or more .csv files. Each item that you want labeled must be in a separate row.
+ Videos – Video files can be any of the following formats: .mp4, .ogg, and .webm. If you want to extract video frames from your video files for object detection or object tracking, see [Provide Video Files](sms-point-cloud-video-input-data.md#sms-point-cloud-video-frame-extraction).
+ Video frames – Video frames are images extracted from a videos. All images extracted from a single video are referred to as a *sequence of video frames*. Each sequence of video frames must have unique prefix keys in Amazon S3. See [Provide Video Frames](sms-point-cloud-video-input-data.md#sms-video-provide-frames). For this data type, see [Set up Automated Video Frame Input Data](sms-video-automated-data-setup.md)

**Important**  
For video frame object detection and video frame object tracking labeling jobs, see [Set up Automated Video Frame Input Data](sms-video-automated-data-setup.md) to learn how to use the automated data setup. 

Use these instructions to automatically set up your input dataset connection with Ground Truth.

**Automatically connect your data in Amazon S3 with Ground Truth**

1. Navigate to the **Create labeling job** page in the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/). 

   This link puts you in the North Virginia (us-east-1) AWS Region. If your input data is in an Amazon S3 bucket in another Region, switch to that Region. To change your AWS Region, on the [navigation bar](https://docs.aws.amazon.com/awsconsolehelpdocs/latest/gsg/getting-started.html#select-region), choose the name of the currently displayed Region.

1. Select **Create labeling job**.

1. Enter a **Job name**. 

1. In the section **Input data setup**, select **Automated data setup**.

1. Enter an Amazon S3 URI for **S3 location for input datasets**. 

1. Specify your **S3 location for output datasets**. This is where your output data is stored. 

1. Choose your **Data type** using the dropdown list.

1. Use the drop down menu under **IAM Role** to select an execution role. If you select **Create a new role**, specify the Amazon S3 buckets that you want grant this role permission to access. This role must have permission to access the S3 buckets you specified in Steps 5 and 6.

1. Select **Complete data setup**.

This creates an input manifest in the Amazon S3 location for input datasets that you specified in step 5. If you are creating a labeling job using the SageMaker API or, AWS CLI, or an AWS SDK, use the Amazon S3 URI for this input manifest file as input to the parameter `ManifestS3Uri`. 

The following GIF demonstrates how to use the automated data setup for image data. This example will create a file, `dataset-YYMMDDTHHMMSS.manifest` in the Amazon S3 bucket `example-groundtruth-images` where `YYMMDDTHHmmSS` indicates the year (`YY`), month (`MM`), day (`DD`) and time in hours (`HH`), minutes (`mm`) and seconds (`ss`), that the input manifest file was created. 

![\[GIF showing how to use the automated data setup for image data.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/sms/gifs/automated-data-setup.gif)


# Supported data formats
<a name="sms-supported-data-formats"></a>

When you create an input manifest file for a [built-in task types](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-task-types.html) manually, your input data must be in one of the following support file formats for the respective input data type. To learn about automated data setup, see [Automate data setup for labeling jobs](sms-console-create-manifest-file.md).

**Tip**  
When you use the automated data setup, additional data formats can be used to generate an input manifest file for video frame and text based task types.


****  

| Task Types | Input Data Type | Support Formats | Example Input Manifest Line | 
| --- | --- | --- | --- | 
|  Bounding Box, Semantic Segmentation, Image Classification (Single Label and Multi-label), Verify and Adjust Labels  |  Image  |  .jpg, .jpeg, .png  |  <pre>{"source-ref": "s3://amzn-s3-demo-bucket1/example-image.png"}</pre>  | 
|  Named Entity Recognition, Text Classification (Single and Multi-Label)  | Text | Raw text |  <pre>{"source": "Lorem ipsum dolor sit amet"}</pre>  | 
|  Video Classification  | Video clips | .mp4, .ogg, and .webm |  <pre>{"source-ref": "s3:///example-video.mp4"}</pre>  | 
| Video Frame Object Detection, Video Frame Object Tracking (bounding boxes, polylines, polygons or keypoint) | Video frames and video frame sequence files (for Object Tracking) |  **Video frames**: .jpg, .jpeg, .png **Sequence files**: .json  | Refer to [Create a Video Frame Input Manifest File](sms-video-manual-data-setup.md#sms-video-create-manifest). | 
|  3D Point Cloud Semantic Segmentation, 3D Point Cloud Object Detection, 3D Point Cloud Object Tracking  | Point clouds and point cloud sequence files (for Object Tracking) |  **Point clouds**: Binary pack format and ASCII. For more information see [Accepted Raw 3D Data Formats](sms-point-cloud-raw-data-types.md). **Sequence files**: .json  | Refer to [Input Manifest Files for 3D Point Cloud Labeling Jobs](sms-point-cloud-input-manifest.md). | 

# Ground Truth streaming labeling jobs
<a name="sms-streaming-labeling-job"></a>

If you want to perpetually send new data objects to Amazon SageMaker Ground Truth to be labeled, use a streaming labeling job. Streaming labeling jobs allow you to:
+ Send new dataset objects to workers in real time using a perpetually running labeling job. Workers continuously receive new data objects to label as long as the labeling job is active and new objects are being sent to it.
+ Gain visibility into the number of objects that have been queued and are waiting to be labeled. Use this information to control the flow of data objects sent to your labeling job.
+ Receive label data for individual data objects in real time as workers finish labeling them. 

Ground Truth streaming labeling jobs remain active until they are manually stopped or have been idle for more than 10 days. You can intermittently send new data objects to workers while the labeling job is active.

If you are a new user of Ground Truth streaming labeling jobs, it is recommended that you review [How it works](#sms-streaming-how-it-works). 

Use [Create a streaming labeling job](sms-streaming-create-job.md) to learn how to create a streaming labeling job.

**Note**  
Ground Truth streaming labeling jobs are only supported through the SageMaker API.

## How it works
<a name="sms-streaming-how-it-works"></a>

When you create a Ground Truth streaming labeling job, the job remains active until it is manually stopped, remains idle for more than 10 days, or is unable to access input data sources. You can intermittently send new data objects to workers while it is active. A worker can continue to receive new data objects in real time as long as the total number of tasks currently available to the worker is less than the value in [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_HumanTaskConfig.html#sagemaker-Type-HumanTaskConfig-MaxConcurrentTaskCount](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_HumanTaskConfig.html#sagemaker-Type-HumanTaskConfig-MaxConcurrentTaskCount). Otherwise, the data object is sent to a queue that Ground Truth creates on your behalf in [Amazon Simple Queue Service](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/welcome.html) (Amazon SQS) for later processing. These tasks are sent to workers as soon as the total number of tasks currently available to a worker falls below `MaxConcurrentTaskCount`. If a data object is not sent to a worker after 14 days, it expires. You can view the number of tasks pending in the queue and adjust the number of objects you send to the labeling job. For example, you may decrease the speed at which you send objects to the labeling job if the backlog of pending objects moves above a threshold. 

**Topics**
+ [How it works](#sms-streaming-how-it-works)
+ [Send data to a streaming labeling job](sms-streaming-how-it-works-send-data.md)
+ [Manage labeling requests with an Amazon SQS queue](sms-streaming-how-it-works-sqs.md)
+ [Receive output data from a streaming labeling job](sms-streaming-how-it-works-output-data.md)
+ [Duplicate message handling](sms-streaming-impotency.md)

# Send data to a streaming labeling job
<a name="sms-streaming-how-it-works-send-data"></a>

You can optionally submit input data to a streaming labeling job one time when you create the labeling job using an input manifest file. Once the labeling job has started and the state is `InProgress`, you can submit new data objects to your labeling job in real time using your Amazon SNS input topic and Amazon S3 event notifications. 

***Submit Data Objects When you Start the Labeling Job (One Time):***
+ **Use an Input Manifest File** – You can optionally specify an input manifest file Amazon S3 URI in `ManifestS3Uri` when you create the streaming labeling job. Ground Truth sends each data object in the manifest file to workers for labeling as soon as the labeling job starts. To learn more, see [Create a Manifest File (Optional)](sms-streaming-manifest.md).

  After you submit a request to create the streaming labeling job, its status will be `Initializing`. Once the labeling job is active, the state changes to `InProgress` and you can start using the real-time options to submit additional data objects for labeling. 

***Submit Data Objects in Real Time:***
+ **Send data objects using Amazon SNS messages** – You can send Ground Truth new data objects to label by sending an Amazon SNS message. You will send this message to an Amazon SNS input topic that you create and specify when you create your streaming labeling job. For more information, see [Send data objects using Amazon SNS](#sms-streaming-how-it-works-sns).
+ **Send data objects by placing them in an Amazon S3 bucket** – Each time you add a new data object to an Amazon S3 bucket, you can prompt Ground Truth to process that object for labeling. To do this, you add an event notification to the bucket so that it notifies your Amazon SNS input topic each time a new object is added to (or *created in*) that bucket. For more information, see [Send data objects using Amazon S3](#sms-streaming-how-it-works-s3). This option is not available for text-based labeling jobs such as text classification and named entity recognition. 
**Important**  
If you use the Amazon S3 configuration, do not use the same Amazon S3 location for your input data configuration and your output data. You specify the S3 prefix for your output data when you create a labeling job.

## Send data objects using Amazon SNS
<a name="sms-streaming-how-it-works-sns"></a>

You can send data objects to your streaming labeling job using Amazon Simple Notification Service (Amazon SNS). Amazon SNS is a web service that coordinates and manages the delivery of messages to and from *endpoints* (for example, an email address or AWS Lambda function). An Amazon SNS *topic* acts as a communication channel between two or more endpoints. You use Amazon SNS to send, or *publish*, new data objects to the topic specified in the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateLabelingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateLabelingJob.html) parameter `SnsTopicArn` in `InputConfig`. The format of these messages is the same as a single line from an [input manifest file](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-data-input.html). 

For example, you may send a piece of text to an active text classification labeling job by publishing it to your input topic. The message that you publish may look similar to the following:

```
{"source": "Lorem ipsum dolor sit amet"}
```

To send a new image object to an image classification labeling job, your message may look similar to the following:

```
{"source-ref": "s3://amzn-s3-demo-bucket/example-image.jpg"}
```

**Note**  
You can also include custom deduplication IDs and deduplication keys in your Amazon SNS messages. To learn more, see [Duplicate message handling](sms-streaming-impotency.md).

When Ground Truth creates your streaming labeling job, it subscribes to your Amazon SNS input topic. 

## Send data objects using Amazon S3
<a name="sms-streaming-how-it-works-s3"></a>

You can send one or more new data objects to a streaming labeling job by placing them in an Amazon S3 bucket that is configured with an Amazon SNS event notification. You can set up an event to notify your Amazon SNS input topic anytime a new object is created in your bucket. You must specify this same Amazon SNS input topic in the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateLabelingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateLabelingJob.html) parameter `SnsTopicArn` in `InputConfig`.

Anytime you configure an Amazon S3 bucket to send notifications to Amazon SNS, Ground Truth will publish a test event, `"s3:TestEvent"`, to ensure that the topic exists and that the owner of the Amazon S3 bucket specified has permission to publish to the specified topic. It is recommended that you set up your Amazon S3 connection with Amazon SNS before starting a streaming labeling job. If you do not, this test event may register as a data object and be sent to Ground Truth for labeling. 

**Important**  
If you use the Amazon S3 configuration, do not use the same Amazon S3 location for your input data configuration and your output data. You specify the S3 prefix for your output data when you create a labeling job.  
For image-based labeling jobs, Ground Truth requires all S3 buckets to have a CORS policy attached. To learn more, see [CORS Requirement for Input Image Data](sms-cors-update.md).

Once you have configured your Amazon S3 bucket and created your labeling job, you can add objects to your bucket and Ground Truth either sends that object to workers or places it on your Amazon SQS queue. 

To learn more, see [Creating Amazon S3 based bucket event notifications based of the Amazon SNS defined in your labeling job](sms-streaming-s3-setup.md).

**Important**  
This option is not available for text-based labeling jobs such as text classification and named entity recognition.

# Manage labeling requests with an Amazon SQS queue
<a name="sms-streaming-how-it-works-sqs"></a>

When Ground Truth creates your streaming labeling job, it creates an Amazon SQS queue in the AWS account used to create the labeling job. The queue name is `GroundTruth-labeling_job_name` where `labeling_job_name` is the name of your labeling job, in lowercase letters. When you send data objects to your labeling job, Ground Truth either sends the data objects directly to workers or places the task in your queue to be processed at a later time. If a data object is not sent to a worker after 14 days, it expires and is removed from the queue. You can setup an alarm in Amazon SQS to detect when objects expire and use this mechanism to control the volume of objects you send to your labeling job.

**Important**  
Modifying, deleting, or sending objects directly to the Amazon SQS queue associated with your streaming labeling job may lead to job failures. 

# Receive output data from a streaming labeling job
<a name="sms-streaming-how-it-works-output-data"></a>

Your Amazon S3 output bucket is periodically updated with new output data from your streaming labeling job. Optionally, you can specify an Amazon SNS output topic. Each time a worker submits a labeled object, a notification with the output data is sent to that topic. You can subscribe an endpoint to your SNS output topic to receive notifications or trigger events when you receive output data from a labeling task. Use an Amazon SNS output topic if you want to do real time chaining to another streaming job and receive an Amazon SNS notifications each time a data object is submitted by a worker.

To learn more, see [Subscribe an Endpoint to Your Amazon SNS Output Topic](sms-create-sns-input-topic.md#sms-streaming-subscribe-output-topic).

# Duplicate message handling
<a name="sms-streaming-impotency"></a>

For data objects sent in real time, Ground Truth guarantees idempotency by ensuring each unique object is only sent for labeling once, even if the input message referring to that object is received multiple times (duplicate messages). To do this, each data object sent to a streaming labeling job is assigned a *deduplication ID*, which is identified with a *deduplication key*. If you send your requests to label data objects directly through your Amazon SNS input topic using Amazon SNS messages, you can optionally choose a custom deduplication key and deduplication IDs for your objects. For more information, see [Specify a deduplication key and ID in an Amazon SNS message](sms-streaming-impotency-create.md).

If you do not provide your own deduplication key, or if you use the Amazon S3 configuration to send data objects to your labeling job, Ground Truth uses one of the following for the deduplication ID:
+ For messages sent directly to your Amazon SNS input topic, Ground Truth uses the SNS message ID. 
+ For messages that come from an Amazon S3 configuration, Ground Truth creates a deduplication ID by combining the Amazon S3 URI of the object with the [sequencer token](https://docs.aws.amazon.com/AmazonS3/latest/dev/notification-content-structure.html) in the message.

# Specify a deduplication key and ID in an Amazon SNS message
<a name="sms-streaming-impotency-create"></a>

When you send a data object to your streaming labeling job using an Amazon SNS message, you have the option to specify your deduplication key and deduplication ID in one of the following ways. In all of these scenarios, identify your deduplication key with `dataset-objectid-attribute-name`.

**Bring Your Own Deduplication Key and ID**

Create your own deduplication key and deduplication ID by configuring your Amazon SNS message as follows. Replace `byo-key` with your key and `UniqueId` with the deduplication ID for that data object.

```
{
    "source-ref":"s3://amzn-s3-demo-bucket/prefix/object1", 
    "dataset-objectid-attribute-name":"byo-key",
    "byo-key":"UniqueId" 
}
```

Your deduplication key can be up to 140 characters. Supported patterns include: `"^[$a-zA-Z0-9](-*[a-zA-Z0-9])*"`.

Your deduplication ID can be up to 1,024 characters. Supported patterns include: `^(https|s3)://([^/]+)/?(.*)$`.

**Use an Existing Key for your Deduplication Key**

You can use an existing key in your message as the deduplication key. When you do this, the value associated with that key is used for the deduplication ID. 

For example, you can specify use the `source-ref` key as your deduplication key by formatting your message as follows: 

```
{
    "source-ref":"s3://amzn-s3-demo-bucket/prefix/object1",
    "dataset-objectid-attribute-name":"source-ref" 
}
```

In this example, Ground Truth uses `"s3://amzn-s3-demo-bucket/prefix/object1"` for the deduplication id.

# Find deduplication key and ID in your output data
<a name="sms-streaming-impotency-output"></a>

You can see the deduplication key and ID in your output data. The deduplication key is identified by `dataset-objectid-attribute-name`. When you use your own custom deduplication key, your output contains something similar to the following:

```
"dataset-objectid-attribute-name": "byo-key",
"byo-key": "UniqueId",
```

When you do not specify a key, you can find the deduplication ID that Ground Truth assigned to your data object as follows. The `$label-attribute-name-object-id` parameter identifies your deduplication ID. 

```
{
    "source-ref":"s3://bucket/prefix/object1", 
    "dataset-objectid-attribute-name":"$label-attribute-name-object-id"
    "label-attribute-name" :0,
    "label-attribute-name-metadata": {...},
    "$label-attribute-name-object-id":"<service-generated-key>"
}
```

For `<service-generated-key>`, if the data object came through an Amazon S3 configuration, Ground Truth adds a unique value used by the service and emits a new field keyed by `$sequencer` which shows the Amazon S3 sequencer used. If object was fed to SNS directly, Ground Truth use the SNS message ID.

**Note**  
Do not use the `$` character in your label attribute name. 

# Input Data Quotas
<a name="input-data-limits"></a>

Input datasets used in semantic segmentation labeling jobs have a quota of 20,000 items. For all other labeling job types, the dataset size quota is 100,000 items. To request an increase to the quota for labeling jobs other than semantic segmentation jobs, review the procedures in [AWS Service Quotas](https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html) to request a quota increase.

Input image data for active and non-active learning labeling jobs must not exceed size and resolution quotas. *Active learning* refers to labeling job that use [automated data labeling](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-automated-labeling.html). *Non-active learning* refers to labeling jobs that don't use automated data labeling.

Additional quotas apply for label categories for all task types, and for input data and labeling category attributes for 3D point cloud and video frame task types. 

## Input File Size Quota
<a name="input-file-size-limit"></a>

Input files can't exceed the following size- quotas for both active and non-active learning labeling jobs. There is no input file size quota for videos used in [video classification](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-video-classification.html) labeling jobs.


| Labeling Job Task Type | Input File Size Quota | 
| --- | --- | 
| Image classification | 40 MB | 
| Bounding box (Object detection) | 40 MB | 
| Semantic segmentation | 40 MB | 
| Bounding box (Object detection) label adjustment | 40 MB | 
| Semantic segmentation label adjustment | 40 MB | 
| Bounding box (Object detection) label verification | 40 MB | 
| Semantic segmentation label verification | 40 MB | 

## Input Image Resolution Quotas
<a name="non-active-learning-input-data-limits"></a>

Image file resolution refers to the number of pixels in an image, and determines the amount of detail an image holds. Image resolution quotas differ depending on the labeling job type and the SageMaker AI built-in algorithm used. The following table lists the resolution quotas for images used in active and non-active learning labeling jobs.


| Labeling Job Task Type | **Resolution Quota - Non Active Learning** | Resolution Quota - Active Learning | 
| --- | --- | --- | 
| Image classification | 100 million pixels | 3840 x 2160 pixels (4 K) | 
| Bounding box (Object detection) | 100 million pixels | 3840 x 2160 pixels (4 K) | 
| Semantic segmentation | 100 million pixels | 1920 x 1080 pixels (1080 p) | 
| Object detection label adjustment | 100 million pixels | 3840 x 2160 pixels (4 K) | 
| Semantic segmentation label adjustment | 100 million pixels | 1920 x 1080 pixels (1080 p) | 
| Object detection label verification | 100 million pixels | Not available | 
| Semantic segmentation label verification | 100 million pixels | Not available | 

## Label Category Quotas
<a name="sms-label-quotas"></a>

Each labeling job task type has a quota for the number of label categories you can specify. Workers select label categories to create annotations. For example, you may specify label categories *car*, *pedestrian*, and *biker* when creating a bounding box labeling job and workers will select the *car* category before drawing bounding boxes around cars.

**Important**  
Label category names cannot exceed 256 characters.   
All label categories must be unique. You cannot specify duplicate label categories. 

The following label category limits apply to labeling jobs. Quotas for label categories depend on whether you use the SageMaker API operation `CreateLabelingJob` or the console to create a labeling job.


****  

| Labeling Job Task Type | Label Category Quota - API | Label Category Quota - Console | 
| --- | --- | --- | 
| Image classification (Multi-label) | 50 | 50 | 
| Image classification (Single label) | Unlimited | 30 | 
| Bounding box (Object detection) | 50 | 50 | 
| Label verification | Unlimited | 30 | 
| Semantic segmentation (with active learning) | 20 | 10 | 
| Semantic segmentation (without active learning) | Unlimited | 10 | 
| Named entity recognition | Unlimited | 30 | 
| Text classification (Multi-label) | 50 | 50 | 
| Text classification (Single label) | Unlimited | 30 | 
| Video classification | 30 | 30 | 
| Video frame object detection | 30 | 30 | 
| Video frame object tracking | 30 | 30 | 
| 3D point cloud object detection | 30 | 30 | 
| 3D point cloud object tracking | 30 | 30 | 
| 3D point cloud semantic segmentation | 30 | 30 | 

## Generative AI Labeling Job Quotas
<a name="gen-ai-labeling-job-quotas"></a>

The following quotas apply for question-answer pairs that you provide in the labeling application.


| Quota Type | Data Quota | 
| --- | --- | 
| Question-answer pairs | Minimum is one pair. Maximum is 20 pairs. | 
| Word count of a question | Minimum is one word. Maximum is 200 words. | 
| Word count of an answer | Minimum is one word. Maximum is 200 words. | 

## 3D Point Cloud and Video Frame Labeling Job Quotas
<a name="sms-input-data-quotas-other"></a>

The following quotas apply for 3D point cloud and video frame labeling job input data.


****  

| Labeling Job Task Type | Input Data Quota | 
| --- | --- | 
| Video frame object detection  |  2,000 video frames (images) per sequence  | 
| Video frame object detection  |  10 video frame sequences per manifest file | 
| Video frame object tracking |  2,000 video frames (images) per sequence  | 
| Video frame object tracking |  10 video frame sequences per manifest file | 
| 3D point cloud object detection |  100,000 point cloud frames per labeling job | 
| 3D point cloud object tracking |  100,000 point cloud frame sequences per labeling job | 
| 3D point cloud object tracking |  500 point cloud frames in each sequence file | 

When you create a video frame or 3D point cloud labeling job, you can add one or more *label category attributes* to each label category that you specify to have workers provide more information about an annotation.

Each label category attribute has a single label category attribute `name`, and a list of one or more options (values) to choose from. To learn more, see [Worker user interface (UI)](sms-point-cloud-general-information.md#sms-point-cloud-worker-task-ui) for 3D point cloud labeling jobs and [Worker user interface (UI)](sms-video-overview.md#sms-video-worker-task-ui) for video frame labeling jobs. 

 The following quotas apply to the number of label category attributes names and values you can specify for labeling jobs.


****  

| Labeling Job Task Type | Label Category Attribute (name) Quota | Label Category Attribute Values Quota | 
| --- | --- | --- | 
| Video frame object detection  | 10 | 10 | 
| Video frame object tracking | 10 | 10 | 
| 3D point cloud object detection | 10 | 10 | 
| 3D point cloud object tracking | 10 | 10 | 
| 3D point cloud semantic segmentation | 10 | 10 | 

# Select Data for Labeling
<a name="sms-data-filtering"></a>

You can use the Amazon SageMaker AI console to select a portion of your dataset for labeling. The data must be stored in an Amazon S3 bucket. You have three options:
+ Use the full dataset.
+ Choose a randomly selected sample of the dataset.
+ Specify a subset of the dataset using a query.

The following options are available in the **Labeling jobs** section of the [SageMaker AI console](https://console.aws.amazon.com/sagemaker/groundtruth) after selecting **Create labeling job**. To learn how to create a labeling job in the console, see [Getting started: Create a bounding box labeling job with Ground Truth](sms-getting-started.md). To configure the dataset that you use for labeling, in the **Job overview** section, choose **Additional configuration**.

## Use the Full Dataset
<a name="sms-full-dataset"></a>

When you choose to use the **Full dataset**, you must provide a manifest file for your data objects. You can provide the path of the Amazon S3 bucket that contains the manifest file or use the SageMaker AI console to create the file. To learn how to create a manifest file using the console, see [Automate data setup for labeling jobs](sms-console-create-manifest-file.md). 

## Choose a Random Sample
<a name="sms-random-dataset"></a>

When you want to label a random subset of your data, select **Random sample**. The dataset is stored in the Amazon S3 bucket specified in the ** Input dataset location ** field. 

After you have specified the percentage of data objects that you want to include in the sample, choose **Create subset**. SageMaker AI randomly picks the data objects for your labeling job. After the objects are selected, choose **Use this subset**. 

SageMaker AI creates a manifest file for the selected data objects. It also modifies the value in the **Input dataset location** field to point to the new manifest file.

## Specify a Subset
<a name="sms-select-dataset"></a>

**Amazon S3 Select**  
Amazon S3 Select is no longer available to new customers. Existing customers of Amazon S3 Select can continue to use the feature as usual. To learn more see, [How to optimize querying your data in Amazon S3](https://aws.amazon.com/blogs/storage/how-to-optimize-querying-your-data-in-amazon-s3/)

You can specify a subset of your data objects using an Amazon S3 `SELECT` query on the object file names. 

The `SELECT` statement of the SQL query is defined for you. You provide the `WHERE` clause to specify which data objects should be returned.

For more information about the Amazon S3 `SELECT` statement, see [ Selecting Content from Objects](https://docs.aws.amazon.com/AmazonS3/latest/dev/selecting-content-from-objects.html).

Choose **Create subset** to start the selection, and then choose **Use this subset** to use the selected data. 

SageMaker AI creates a manifest file for the selected data objects. It also updates the value in the **Input dataset location** field to point to the new manifest file.