

# Development troubleshooting
<a name="troubleshooting-development"></a>

This section contains information about diagnosing and fixing development issues with your Managed Service for Apache Flink application.

**Topics**
+ [System rollback best practices](troubleshooting-system-rollback.md)
+ [Hudi configuration best practices](troubleshooting-hudi.md)
+ [Apache Flink Flame Graphs](troubleshooting-update-flamegraphs.md)
+ [Credential provider issue with EFO connector 1.15.2](troubleshooting-credential-provider.md)
+ [Applications with unsupported Kinesis connectors](troubleshooting-unsupported-kinesis-connectors.md)
+ [Compile error: "Could not resolve dependencies for project"](troubleshooting-compile.md)
+ [Invalid choice: "kinesisanalyticsv2"](troubleshooting-cli-update.md)
+ [UpdateApplication action isn't reloading application code](troubleshooting-update.md)
+ [S3 StreamingFileSink FileNotFoundExceptions](troubleshooting-s3sink.md)
+ [FlinkKafkaConsumer issue with stop with savepoint](troubleshooting-FlinkKafkaConsumer.md)
+ [Flink 1.15 Async Sink Deadlock](troubleshooting-async-deadlock.md)
+ [Amazon Kinesis data streams source processing out of order during re-sharding](troubleshooting-kinesis-data-streams-processing-out-of-order.md)
+ [Real-time vector embedding blueprints FAQ and troubleshooting](troubleshooting-blueprints.md)

# System rollback best practices
<a name="troubleshooting-system-rollback"></a>

With automatic system rollback and operations visibility capabilities in Amazon Managed Service for Apache Flink, you can identify and resolve issues with your applications.

## System rollbacks
<a name="troubleshooting-unsupported-kinesis-connectors-error"></a>

If your application update or scaling operation fails due to a customer error, such as a code bug or permission issue, Amazon Managed Service for Apache Flink automatically attempts to roll back to the previous running version if you have opted in to this functionality. For more information, see [Enable system rollbacks for your Managed Service for Apache Flink application](how-system-rollbacks.md). If this autorollback fails or you have not opted in or opted out, your application will be placed into the `READY` state. To update your application, complete the following steps:   Check the Amazon Managed Service for Apache Flink console or use the `DescribeApplicationOperation` API to see the error description for why the operation failed.    For the full error stack, use [Cloudwatch logs](https://docs.aws.amazon.com/managed-flink/latest/java/logging.html).   Common issues are insufficient permissions, incompatible code changes, or infrastructure misconfigurations. Resolve the underlying issue.    Use the `UpdateApplicaton` API to redeploy your new application version.    

## Manual rollback
<a name="troubleshooting-unsupported-kinesis-connectors-error"></a>

If the application is not progressing and is in a transient state for long, or if the application successfully transitioned to `Running`, but you see downstream issues like processing errors in a successfully updated Flink application, you can manually roll it back using the `RollbackApplication` API.

1. Call `RollbackApplication` - this will revert to the previous running version and restore the previous state. 

1. Monitor the rollback operation using the `DescribeApplicationOperation` API.

1. If rollback fails, use the previous system rollback steps.

## Operations visibility
<a name="troubleshooting-unsupported-kinesis-connectors-error"></a>

The `ListApplicationOperations` API shows the history of all customer and system operations on your application.

1. Get the *operationId* of the failed operation from the list.

1. Call `DescribeApplicationOperation` and check the status and *statusDescription*.

1. If an operation failed, the description points to a potential error to investigate. 

**Common error code bugs:** Use the rollback capabilities to revert to the last working version. Resolve bugs and retry the update. 

**Permission issues:** Use the `DescribeApplicationOperation` to see the required permissions. Update application permissions and retry. 

**Amazon Managed Service for Apache Flink service issues:** Check the AWS Health Dashboard or open a support case.

# Hudi configuration best practices
<a name="troubleshooting-hudi"></a>

To run Hudi connectors on Managed Service for Apache Flink we recommend the following configuration changes.

Disable `hoodie.embed.timeline.server`

Hudi connector on Flink sets up an embedded timeline (TM) server on the Flink jobmanager (JM) to cache metadata to improve performance when job parallelism is high. We recommend that you disable this embedded server on Managed Service for Apache Flink because we disable non-Flink communication between JM and TM.

If this server is enabled, Hudi writes will first attempt to connect to the embedded server on JM, and then fall back to reading metadata from Amazon S3. This means that Hudi incurs a connection timeout that delays Hudi writes and causes a performance impact on Managed Service for Apache Flink.

# Apache Flink Flame Graphs
<a name="troubleshooting-update-flamegraphs"></a>

Flame Graphs are enabled by default on applications in Managed Service for Apache Flink versions that support it. Flame Graphs may affect application performance if you keep the graph open, as mentioned in [Flink documentation](https://nightlies.apache.org/flink/flink-docs-release-1.15//docs/ops/debugging/flame_graphs/). 

 If you want to disable Flame Graphs for your application, create a case to request it to be disabled for your application ARN. For more information, see the [AWS Support Center](https://console.aws.amazon.com/support/home#/).

# Credential provider issue with EFO connector 1.15.2
<a name="troubleshooting-credential-provider"></a>

There is a [known issue](https://issues.apache.org/jira/browse/FLINK-29205) with Kinesis Data Streams EFO connector versions up to 1.15.2 where the `FlinkKinesisConsumer` is not respecting `Credential Provider` configuration. Valid configurations are being disregarded due to the issue, which results in the `AUTO` credential provider being used. This can cause a problem using cross-account access to Kinesis using EFO connector.

To resolve this error please use EFO connector version 1.15.3 or higher. 

# Applications with unsupported Kinesis connectors
<a name="troubleshooting-unsupported-kinesis-connectors"></a>

Managed Service for Apache Flink for Apache Flink version 1.15 or later will [automatically reject applications from starting or updating](https://docs.aws.amazon.com/managed-flink/latest/java/flink-1-15-2.html) if they are using unsupported Kinesis Connector versions (pre-version 1.15.2) bundled into application JARs or archives (ZIP). 

## Rejection error
<a name="troubleshooting-unsupported-kinesis-connectors-error"></a>

You will see the following error when submitting create / update application calls through:

```
An error occurred (InvalidArgumentException) when calling the CreateApplication operation: An unsupported Kinesis connector version has been detected in the application. Please update flink-connector-kinesis to any version equal to or newer than 1.15.2.
For more information refer to connector fix: https://issues.apache.org/jira/browse/FLINK-23528
```

## Steps to remediate
<a name="troubleshooting-unsupported-kinesis-connectors-steps-to-remediate"></a>
+ Update the application’s dependency on `flink-connector-kinesis`. If you are using Maven as your project’s build tool, follow [Update a Maven dependency](#troubleshooting-unsupported-kinesis-connectors-update-maven-dependency). If you are using Gradle, follow [Update a Gradle dependency](#troubleshooting-unsupported-kinesis-connectors-update-gradle-dependency).
+ Repackage the application.
+ Upload to an Amazon S3 bucket.
+ Resubmit the create / update application request with the revised application just uploaded to the Amazon S3 bucket.
+ If you continue to see the same error message, re-check your application dependencies. If the problem persists please create a support ticket. 

### Update a Maven dependency
<a name="troubleshooting-unsupported-kinesis-connectors-update-maven-dependency"></a>

1. Open the project’s `pom.xml`.

1. Find the project’s dependencies. They look like:

   ```
   <project>
   
       ...
   
       <dependencies>
   
           ...
   
           <dependency>
               <groupId>org.apache.flink</groupId>
               <artifactId>flink-connector-kinesis</artifactId>
           </dependency>
   
           ...
   
       </dependencies>
   
       ...
   
   </project>
   ```

1. Update `flink-connector-kinesis` to a version that is equal to or newer than 1.15.2. For instance:

   ```
   <project>
   
       ...
   
       <dependencies>
   
           ...
   
           <dependency>
               <groupId>org.apache.flink</groupId>
               <artifactId>flink-connector-kinesis</artifactId>
               <version>1.15.2</version>
           </dependency>
   
           ...
   
       </dependencies>
   
       ...
   
   </project>
   ```

### Update a Gradle dependency
<a name="troubleshooting-unsupported-kinesis-connectors-update-gradle-dependency"></a>

1. Open the project’s `build.gradle` (or `build.gradle.kts` for Kotlin applications). 

1. Find the project’s dependencies. They look like:

   ```
   ...
   
   dependencies {
   
       ...
   
       implementation("org.apache.flink:flink-connector-kinesis")
   
       ...
   
   }
   
   ...
   ```

1. Update `flink-connector-kinesis` to a version that is equal to or newer than 1.15.2. For instance:

   ```
   ...
   
   dependencies {
   
       ...
   
       implementation("org.apache.flink:flink-connector-kinesis:1.15.2")
   
       ...
   
   }
   
   ...
   ```

# Compile error: "Could not resolve dependencies for project"
<a name="troubleshooting-compile"></a>

In order to compile the Managed Service for Apache Flink sample applications, you must first download and compile the Apache Flink Kinesis connector and add it to your local Maven repository. If the connector hasn't been added to your repository, a compile error similar to the following appears:

```
Could not resolve dependencies for project your project name: Failure to find org.apache.flink:flink-connector-kinesis_2.11:jar:1.8.2 in https://repo.maven.apache.org/maven2 was cached in the local repository, resolution will not be reattempted until the update interval of central has elapsed or updates are forced
```

To resolve this error, you must download the Apache Flink source code (version 1.8.2 from [https://flink.apache.org/downloads.html](https://flink.apache.org/downloads.html)) for the connector. For instructions about how to download, compile, and install the Apache Flink source code, see [Using the Apache Flink Kinesis Streams connector with previous Apache Flink versions](earlier.md#how-creating-apps-building-kinesis).

# Invalid choice: "kinesisanalyticsv2"
<a name="troubleshooting-cli-update"></a>

To use v2 of the Managed Service for Apache Flink API, you need the latest version of the AWS Command Line Interface (AWS CLI).

For information about upgrading the AWS CLI, see [ Installing the AWS Command Line Interface](https://docs.aws.amazon.com/cli/latest/userguide/installing.html) in the *AWS Command Line Interface User Guide*.

# UpdateApplication action isn't reloading application code
<a name="troubleshooting-update"></a>

The [UpdateApplication](https://docs.aws.amazon.com/managed-flink/latest/apiv2/API_UpdateApplication.html) action will not reload application code with the same file name if no S3 object version is specified. To reload application code with the same file name, enable versioning on your S3 bucket, and specify the new object version using the `ObjectVersionUpdate` parameter. For more information about enabling object versioning in an S3 bucket, see [Enabling or Disabling Versioning](https://docs.aws.amazon.com/AmazonS3/latest/user-guide/enable-versioning.html).

# S3 StreamingFileSink FileNotFoundExceptions
<a name="troubleshooting-s3sink"></a>

Managed Service for Apache Flink applications can run into In-progress part file `FileNotFoundException` when starting from snapshots if an In-progress part file referred to by its savepoint is missing. When this failure mode occurs, the Managed Service for Apache Flink application’s operator state is usually non-recoverable and must be restarted without snapshot using `SKIP_RESTORE_FROM_SNAPSHOT`. See following example stacktrace:

```
java.io.FileNotFoundException: No such file or directory: s3://amzn-s3-demo-bucket/pathj/INSERT/2023/4/19/7/_part-2-1234_tmp_12345678-1234-1234-1234-123456789012
        at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2231)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2149)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2088)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.open(S3AFileSystem.java:699)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:950)
        at org.apache.flink.fs.s3hadoop.HadoopS3AccessHelper.getObject(HadoopS3AccessHelper.java:98)
        at org.apache.flink.fs.s3.common.writer.S3RecoverableMultipartUploadFactory.recoverInProgressPart(S3RecoverableMultipartUploadFactory.java:97)
...
```

Flink `StreamingFileSink` writes records to filesystems supported by the [File Systems](https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/filesystems/overview/). Given that the incoming streams can be unbounded, data is organized into part files of finite size with new files added as data is written. Part lifecycle and rollover policy determine the timing, size and the naming of the part files. 

During checkpointing and savepointing (snapshotting), all Pending files are renamed and committed. However, In-progress part files are not committed but renamed and their reference is kept within checkpoint or savepoint metadata to be used when restoring jobs. These In-progress part files will eventually rollover to Pending, renamed and committed by a subsequent checkpoint or savepoint.

Following are the root causes and mitigation for missing In-progress part file:
+ Stale snapshot used to start the Managed Service for Apache Flink application – only the latest system snapshot taken when an application is stopped or updated can be used to start a Managed Service for Apache Flink application with Amazon S3 StreamingFileSink. To avoid this class of failure, use the latest system snapshot.
  + This happens for example when you pick a snapshot created using `CreateSnapshot` instead of a system-triggered Snapshot during stop or update. The older snapshot’s savepoint keeps an out-of-date reference to In-progress part file that has been renamed and committed by subsequent checkpoint or savepoint.
  + This can also happen when a system triggered snapshot from non-latest Stop/Update event is picked. An example is an application with system snapshot disabled but has `RESTORE_FROM_LATEST_SNAPSHOT` configured. Generally, Managed Service for Apache Flink applications with Amazon S3 StreamingFileSink should always have system snapshot enabled and `RESTORE_FROM_LATEST_SNAPSHOT` configured.
+ In-progress part file removed – As the In-progress part file is located in an S3 bucket, it can be removed by other components or actors which have access to the bucket. 
  + This can happen when you have stopped your app for too long and the In-progress part file referred to by your app’s savepoint has been removed by [S3 bucket MultiPartUpload](https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpu-abort-incomplete-mpu-lifecycle-config.html) lifecycle policy. To avoid this class of failure, make sure that your S3 Bucket MPU lifecycle policy covers a sufficiently large period for your use case.
  + This can also happen when the In-progress part file has been removed manually or by another one of your system’s components. To avoid this class of failure, please make sure that In-progress part files are not removed by other actors or components.
+ Race condition where an automated checkpoint is triggered after savepoint – This affects Managed Service for Apache Flink versions up to and including 1.13. This issue is fixed in Managed Service for Apache Flink version 1.15. Migrate your application to the latest version of Managed Service for Apache Flink to prevent recurrence. We also suggest migrating from StreamingFileSink to [FileSink](https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/connectors/datastream/filesystem/#file-sink).
  + When applications are stopped or updated, Managed Service for Apache Flink triggers a savepoint and stops the application in two steps. If an automated checkpoint triggers between the two steps, the savepoint will be unusable as its In-progress part file would be renamed and potentially committed.

# FlinkKafkaConsumer issue with stop with savepoint
<a name="troubleshooting-FlinkKafkaConsumer"></a>

When using the legacy FlinkKafkaConsumer there is a possibility your application may get stuck in UPDATING, STOPPING or SCALING, if you have system snapshots enabled. There is no published fix available for this [issue](https://issues.apache.org/jira/browse/FLINK-28758), therefore we recommend you upgrade to the new [KafkaSource](https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/datastream/kafka/#kafka-source) to mitigate this issue. 

If you are using the `FlinkKafkaConsumer` with snapshots enabled, there is a possibility when the Flink job processes a stop with savepoint API request, the `FlinkKafkaConsumer` can fail with a runtime error reporting a `ClosedException`. Under these conditions the Flink application becomes stuck, manifesting as Failed Checkpoints. 

# Flink 1.15 Async Sink Deadlock
<a name="troubleshooting-async-deadlock"></a>

There is a [known issue](https://issues.apache.org/jira/browse/FLINK-32230) with AWS connectors for Apache Flink implementing AsyncSink interface. This affects applications using Flink 1.15 with the following connectors: 
+ For Java applications:
  + KinesisStreamsSink – `org.apache.flink:flink-connector-kinesis`
  + KinesisStreamsSink – `org.apache.flink:flink-connector-aws-kinesis-streams`
  + KinesisFirehoseSink – `org.apache.flink:flink-connector-aws-kinesis-firehose`
  + DynamoDbSink – `org.apache.flink:flink-connector-dynamodb`
+ Flink SQL/TableAPI/Python applications:
  + kinesis – `org.apache.flink:flink-sql-connector-kinesis`
  + kinesis – `org.apache.flink:flink-sql-connector-aws-kinesis-streams`
  + firehose – `org.apache.flink:flink-sql-connector-aws-kinesis-firehose`
  + dynamodb – `org.apache.flink:flink-sql-connector-dynamodb`

Affected applications will experience the following symptoms:
+ Flink job is in `RUNNING` state, but not processing data;
+ There are no job restarts;
+ Checkpoints are timing out.

The issue is caused by a [bug](https://github.com/aws/aws-sdk-java-v2/issues/4354) in AWS SDK resulting in it not surfacing certain errors to the caller when using the async HTTP client. This results in the sink waiting indefinitely for an “in-flight request” to complete during a checkpoint flush operation.

This issue had been fixed in AWS SDK starting from version **2.20.144**. 

Following are instructions on how to update affected connectors to use the new version of AWS SDK in your applications:

**Topics**
+ [Update Java applications](troubleshooting-async-deadlock-update-java-apps.md)
+ [Update Python applications](troubleshooting-async-deadlock-update-python-apps.md)

# Update Java applications
<a name="troubleshooting-async-deadlock-update-java-apps"></a>

Follow the procedures below to update Java applications:

## flink-connector-kinesis
<a name="troubleshooting-async-deadlock-update-java-apps-flink-connector-kinesis"></a>

If the application uses `flink-connector-kinesis`:

Kinesis connector uses shading to package some dependencies, including the AWS SDK, into the connector jar. To update the AWS SDK version, use the following procedure to replace these shaded classes:

------
#### [ Maven ]

1. Add Kinesis connector and required AWS SDK modules as project dependencies.

1. Configure `maven-shade-plugin`:

   1. Add filter to exclude shaded AWS SDK classes when copying content of the Kinesis connector jar.

   1. Add relocation rule to move updated AWS SDK classes to package, expected by Kinesis connector.

   **pom.xml** 

   ```
   <project>
       ...    
       <dependencies>
           ...
           <dependency>
               <groupId>org.apache.flink</groupId>
               <artifactId>flink-connector-kinesis</artifactId>
               <version>1.15.4</version>
           </dependency>
           
           <dependency>
               <groupId>software.amazon.awssdk</groupId>
               <artifactId>kinesis</artifactId>
               <version>2.20.144</version>
           </dependency>
           <dependency>
               <groupId>software.amazon.awssdk</groupId>
               <artifactId>netty-nio-client</artifactId>
               <version>2.20.144</version>
           </dependency>
           <dependency>
               <groupId>software.amazon.awssdk</groupId>
               <artifactId>sts</artifactId>
               <version>2.20.144</version>
           </dependency>
           ...
       </dependencies>
       ...
       <build>
           ...
           <plugins>
               ...
               <plugin>
                   <groupId>org.apache.maven.plugins</groupId>
                   <artifactId>maven-shade-plugin</artifactId>
                   <version>3.1.1</version>
                   <executions>
                       <execution>
                           <phase>package</phase>
                           <goals>
                               <goal>shade</goal>
                           </goals>
                           <configuration>
                               ...
                               <filters>
                                   ...
                                   <filter>
                                       <artifact>org.apache.flink:flink-connector-kinesis</artifact>
                                       <excludes>
                                           <exclude>org/apache/flink/kinesis/shaded/software/amazon/awssdk/**</exclude>
                                           <exclude>org/apache/flink/kinesis/shaded/org/reactivestreams/**</exclude>
                                           <exclude>org/apache/flink/kinesis/shaded/io/netty/**</exclude>
                                           <exclude>org/apache/flink/kinesis/shaded/com/typesafe/netty/**</exclude>
                                       </excludes>
                                   </filter>
                                   ...
                               </filters>
                               <relocations>
                                   ...
                                   <relocation>
                                       <pattern>software.amazon.awssdk</pattern>
                                       <shadedPattern>org.apache.flink.kinesis.shaded.software.amazon.awssdk</shadedPattern>
                                   </relocation>
                                   <relocation>
                                       <pattern>org.reactivestreams</pattern>
                                       <shadedPattern>org.apache.flink.kinesis.shaded.org.reactivestreams</shadedPattern>
                                   </relocation>
                                   <relocation>
                                       <pattern>io.netty</pattern>
                                       <shadedPattern>org.apache.flink.kinesis.shaded.io.netty</shadedPattern>
                                   </relocation>
                                   <relocation>
                                       <pattern>com.typesafe.netty</pattern>
                                       <shadedPattern>org.apache.flink.kinesis.shaded.com.typesafe.netty</shadedPattern>
                                   </relocation>
                                   ...
                               </relocations>
                              ...
                           </configuration>
                       </execution>
                   </executions>
               </plugin>
               ...
           </plugins>
           ... 
       </build>
   </project>
   ```

------
#### [ Gradle ]

1. Add Kinesis connector and required AWS SDK modules as project dependencies.

1. Adjust shadowJar configuration:

   1. Exclude shaded AWS SDK classes when copying content of the Kinesis connector jar.

   1. Relocate updated AWS SDK classes to a package expected by Kinesis connector.

   **build.gradle**

   ```
   ...
   dependencies {
       ...
       flinkShadowJar("org.apache.flink:flink-connector-kinesis:1.15.4")
       
       flinkShadowJar("software.amazon.awssdk:kinesis:2.20.144")
       flinkShadowJar("software.amazon.awssdk:sts:2.20.144")
       flinkShadowJar("software.amazon.awssdk:netty-nio-client:2.20.144")
       ...
   }
   ...
   shadowJar {
       configurations = [project.configurations.flinkShadowJar]
   
       exclude("software/amazon/kinesis/shaded/software/amazon/awssdk/**/*")
       exclude("org/apache/flink/kinesis/shaded/org/reactivestreams/**/*.class")
       exclude("org/apache/flink/kinesis/shaded/io/netty/**/*.class")
       exclude("org/apache/flink/kinesis/shaded/com/typesafe/netty/**/*.class")
       
       relocate("software.amazon.awssdk", "org.apache.flink.kinesis.shaded.software.amazon.awssdk")
       relocate("org.reactivestreams", "org.apache.flink.kinesis.shaded.org.reactivestreams")
       relocate("io.netty", "org.apache.flink.kinesis.shaded.io.netty")
       relocate("com.typesafe.netty", "org.apache.flink.kinesis.shaded.com.typesafe.netty")
   }
   ...
   ```

------

## Other affected connectors
<a name="troubleshooting-async-deadlock-update-java-apps-flink-another-connector"></a>

If the application uses another affected connector:

In order to update the AWS SDK version, the SDK version should be enforced in the project build configuration.

------
#### [ Maven ]

Add AWS SDK bill of materials (BOM) to the dependency management section of the `pom.xml` file to enforce SDK version for the project.

**pom.xml**

```
<project>
    ...    
    <dependencyManagement>
        <dependencies>
            ...
            <dependency>
                <groupId>software.amazon.awssdk</groupId>
                <artifactId>bom</artifactId>
                <version>2.20.144</version>
                <scope>import</scope>
                <type>pom</type>
            </dependency>
            ...
        </dependencies>
    </dependencyManagement>
    ...
</project>
```

------
#### [ Gradle ]

Add platform dependency on the AWS SDK bill of materials (BOM) to enforce SDK version for the project. This requires Gradle 5.0 or newer:

**build.gradle**

```
...
dependencies {
    ...
    flinkShadowJar(platform("software.amazon.awssdk:bom:2.20.144"))
    ...
}
...
```

------

# Update Python applications
<a name="troubleshooting-async-deadlock-update-python-apps"></a>

Python applications can use connectors in 2 different ways: packaging connectors and other Java dependencies as part of single uber-jar, or use connector jar directly. To fix applications affected by Async Sink deadlock:
+ If the application uses an uber jar, follow the instructions for [Update Java applications](troubleshooting-async-deadlock-update-java-apps.md).
+ To rebuild connector jars from source, use the following steps:

**Building connectors from source:**

Prerequisites, similar to Flink [build requirements](https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/flinkdev/building/#build-flink):
+ Java 11
+ Maven 3.2.5

## flink-sql-connector-kinesis
<a name="troubleshooting-async-deadlock-update-python-apps-flink-sql-connector-kinesis"></a>

1. Download source code for Flink 1.15.4:

   ```
   wget https://archive.apache.org/dist/flink/flink-1.15.4/flink-1.15.4-src.tgz
   ```

1. Uncompress source code:

   ```
   tar -xvf flink-1.15.4-src.tgz
   ```

1. Navigate to kinesis connector directory

   ```
   cd flink-1.15.4/flink-connectors/flink-connector-kinesis/
   ```

1. Compile and install connector jar, specifying required AWS SDK version. To speed up build use `-DskipTests` to skip test execution and `-Dfast` to skip additional source code checks:

   ```
   mvn clean install -DskipTests -Dfast -Daws.sdkv2.version=2.20.144
   ```

1. Navigate to kinesis connector directory

   ```
   cd ../flink-sql-connector-kinesis
   ```

1. Compile and install sql connector jar:

   ```
   mvn clean install -DskipTests -Dfast
   ```

1. Resulting jar will be available at:

   ```
   target/flink-sql-connector-kinesis-1.15.4.jar
   ```

## flink-sql-connector-aws-kinesis-streams
<a name="troubleshooting-async-deadlock-update-python-apps-flink-sql-connector-aws-kinesis-streams"></a>

1. Download source code for Flink 1.15.4:

   ```
   wget https://archive.apache.org/dist/flink/flink-1.15.4/flink-1.15.4-src.tgz
   ```

1. Uncompress source code:

   ```
   tar -xvf flink-1.15.4-src.tgz
   ```

1. Navigate to kinesis connector directory

   ```
   cd flink-1.15.4/flink-connectors/flink-connector-aws-kinesis-streams/
   ```

1. Compile and install connector jar, specifying required AWS SDK version. To speed up build use `-DskipTests` to skip test execution and `-Dfast` to skip additional source code checks:

   ```
   mvn clean install -DskipTests -Dfast -Daws.sdk.version=2.20.144
   ```

1. Navigate to kinesis connector directory

   ```
   cd ../flink-sql-connector-aws-kinesis-streams
   ```

1. Compile and install sql connector jar:

   ```
   mvn clean install -DskipTests -Dfast
   ```

1. Resulting jar will be available at:

   ```
   target/flink-sql-connector-aws-kinesis-streams-1.15.4.jar
   ```

## flink-sql-connector-aws-kinesis-firehose
<a name="troubleshooting-async-deadlock-update-python-apps-flink-sql-connector-kinesis-firehose"></a>

1. Download source code for Flink 1.15.4:

   ```
   wget https://archive.apache.org/dist/flink/flink-1.15.4/flink-1.15.4-src.tgz
   ```

1. Uncompress source code:

   ```
   tar -xvf flink-1.15.4-src.tgz
   ```

1. Navigate to connector directory

   ```
   cd flink-1.15.4/flink-connectors/flink-connector-aws-kinesis-firehose/
   ```

1. Compile and install connector jar, specifying required AWS SDK version. To speed up build use `-DskipTests` to skip test execution and `-Dfast` to skip additional source code checks:

   ```
   mvn clean install -DskipTests -Dfast -Daws.sdk.version=2.20.144
   ```

1. Navigate to sql connector directory

   ```
   cd ../flink-sql-connector-aws-kinesis-firehose
   ```

1. Compile and install sql connector jar:

   ```
   mvn clean install -DskipTests -Dfast
   ```

1. Resulting jar will be available at:

   ```
   target/flink-sql-connector-aws-kinesis-firehose-1.15.4.jar
   ```

## flink-sql-connector-dynamodb
<a name="troubleshooting-async-deadlock-update-python-apps-flink-sql-connector-dynamodb"></a>

1. Download source code for Flink 1.15.4:

   ```
   wget https://archive.apache.org/dist/flink/flink-connector-aws-3.0.0/flink-connector-aws-3.0.0-src.tgz
   ```

1. Uncompress source code:

   ```
   tar -xvf flink-connector-aws-3.0.0-src.tgz
   ```

1. Navigate to connector directory

   ```
   cd flink-connector-aws-3.0.0
   ```

1. Compile and install connector jar, specifying required AWS SDK version. To speed up build use `-DskipTests` to skip test execution and `-Dfast` to skip additional source code checks:

   ```
   mvn clean install -DskipTests -Dfast -Dflink.version=1.15.4 -Daws.sdk.version=2.20.144
   ```

1. Resulting jar will be available at:

   ```
   flink-sql-connector-dynamodb/target/flink-sql-connector-dynamodb-3.0.0.jar
   ```

# Amazon Kinesis data streams source processing out of order during re-sharding
<a name="troubleshooting-kinesis-data-streams-processing-out-of-order"></a>

The current FlinkKinesisConsumer implementation doesn’t provide strong ordering guarantees between Kinesis shards. This may lead to out-of-order processing during re-sharding of Kinesis Stream, in particular for Flink applications that experience processing lag. Under some circumstances, for example windows operators based on event times, events might get discarded because of the resulting lateness. 

![\[Diagram showing shards and shard consumers with time progression and trim horizon.\]](http://docs.aws.amazon.com/managed-flink/latest/java/images/flink-ts.png)


This is a [known problem](https://issues.apache.org/jira/browse/FLINK-6349) in Open Source Flink. Until connector fix is made available, ensure your Flink applications are not falling behind Kinesis Data Streams during re-partitioning. By ensuring that the processing delay is tolerated by your Flink apps, you can minimize the impact of out-of-order processing and risk of data loss. 

# Real-time vector embedding blueprints FAQ and troubleshooting
<a name="troubleshooting-blueprints"></a>

Review the following FAQ and troubleshooting sections to troubleshoot real-time vector embedding blueprint issues. For more information about real-time vector embedding blueprints, see [Real-time vector embedding blueprints](https://docs.aws.amazon.com/msk/latest/developerguide/ai-vector-embedding-integration-learn-more.html).

For general Managed Service for Apache Flink application troubleshooting, see [https://docs.aws.amazon.com/managed-flink/latest/java/troubleshooting-runtime.html](https://docs.aws.amazon.com/managed-flink/latest/java/troubleshooting-runtime.html).

**Topics**
+ [Real-time vector embedding blueprints - FAQ](troubleshooting-blueprints-FAQ.md)
+ [Real-time vector embedding blueprints - troubleshooting](troubleshooting-blueprints-TS.md)

# Real-time vector embedding blueprints - FAQ
<a name="troubleshooting-blueprints-FAQ"></a>

Review the following FAQ about real-time vector embedding blueprints. For more information about real-time vector embedding blueprints, see [Real-time vector embedding blueprints](https://docs.aws.amazon.com/msk/latest/developerguide/ai-vector-embedding-integration-learn-more.html).

**Topics**
+ [What AWS resources does this blueprint create?](#troubleshooting-blueprints-1)
+ [What are my actions after the AWS CloudFormation stack deployment is complete?](#troubleshooting-blueprints-2)
+ [What should be the structure of the data in the source Amazon MSK topic(s)?](#troubleshooting-blueprints-3)
+ [Can I specify parts of a message to embed?](#troubleshooting-blueprints-4)
+ [Can I read data from multiple Amazon MSK topics?](#troubleshooting-blueprints-5)
+ [Can I use regex to configure Amazon MSK topic names?](#troubleshooting-blueprints-6)
+ [What is the maximum size of a message that can be read from an Amazon MSK topic?](#troubleshooting-blueprints-7)
+ [What type of OpenSearch is supported?](#troubleshooting-blueprints-8)
+ [Why do I need to use a vector search collection, vector index, and add a vector field in my OpenSearch Serverless colelction?](#troubleshooting-blueprints-9)
+ [What should I set as the dimension for my vector field?](#troubleshooting-blueprints-10)
+ [What does the output look like in the configured OpenSearch index?](#troubleshooting-blueprints-11)
+ [Can I specify metadata fields to add to the document stored in the OpenSearch index?](#troubleshooting-blueprints-12)
+ [Should I expect duplicate entries in the OpenSearch index?](#troubleshooting-blueprints-13)
+ [Can I send data to multiple OpenSearch indices?](#troubleshooting-blueprints-14)
+ [Can I deploy multiple real-time vector embedding applications in a single AWS account?](#troubleshooting-blueprints-15)
+ [Can multiple real-time vector embedding applications use the same data source or sink?](#troubleshooting-blueprints-16)
+ [Does the application support cross-account connectivity?](#troubleshooting-blueprints-17)
+ [Does the application support cross-Region connectivity?](#troubleshooting-blueprints-18)
+ [Can my Amazon MSK cluster and OpenSearch collection be in different VPCs or subnets?](#troubleshooting-blueprints-19)
+ [What embedding models are supported by the application?](#troubleshooting-blueprints-20)
+ [Can I fine-tune the performance of my application based on my workload?](#troubleshooting-blueprints-21)
+ [What Amazon MSK authentication types are supported?](#troubleshooting-blueprints-22)
+ [What is `sink.os.bulkFlushIntervalMillis` and how do I set it?](#troubleshooting-blueprints-23)
+ [When I deploy my Managed Service for Apache Flink application, from what point in the Amazon MSK topic will it begin reading messages?](#troubleshooting-blueprints-24)
+ [How do I use `source.msk.starting.offset`?](#troubleshooting-blueprints-25)
+ [What chunking strategies are supported?](#troubleshooting-blueprints-26)
+ [How do I read records in my vector datastore?](#troubleshooting-blueprints-27)
+ [Where can I find new updates to the source code?](#troubleshooting-blueprints-28)
+ [Can I make a change to the AWS CloudFormation template and update the Managed Service for Apache Flink application?](#troubleshooting-blueprints-29)
+ [Will AWS monitor and maintain the application on my behalf?](#troubleshooting-blueprints-30)
+ [Does this application move my data outside my AWS account?](#troubleshooting-blueprints-31)

## What AWS resources does this blueprint create?
<a name="troubleshooting-blueprints-1"></a>

To find resources deployed in your account, navigate to AWS CloudFormation console and identify the stack name that starts with the name you provided for your Managed Service for Apache Flink application. Choose the **Resources** tab to check the resources that were created as part of the stack. The following are the key resources that the stack creates:
+ Real-time vector embedding Managed Service for Apache Flink application
+ Amazon S3 bucket for holding the source code for the real-time vector embedding application
+ CloudWatch log group and log stream for storing logs
+ Lambda functions for fetching and creating resources
+ IAM roles and policies for Lambdas, Managed Service for Apache Flink application, and accessing Amazon Bedrock and Amazon OpenSearch Service
+ Data access policy for Amazon OpenSearch Service
+ VPC endpoints for accessing Amazon Bedrock and Amazon OpenSearch Service

## What are my actions after the AWS CloudFormation stack deployment is complete?
<a name="troubleshooting-blueprints-2"></a>

After the AWS CloudFormation stack deployment is complete, access the Managed Service for Apache Flink console and find your blueprint Managed Service for Apache Flink application. Choose the **Configure** tab and confirm that all runtime properties are setup correctly. They may overflow to the next page. When you are confident of the settings, choose **Run**. The application will start ingesting messages from your topic.

To check for new releases, see [https://github.com/awslabs/real-time-vectorization-of-streaming-data/releases](https://github.com/awslabs/real-time-vectorization-of-streaming-data/releases).

## What should be the structure of the data in the source Amazon MSK topic(s)?
<a name="troubleshooting-blueprints-3"></a>

We currently support structured and unstructured source data. 
+ Unstructured data is denoted by `STRING` in `source.msk.data.type`. The data is read as is from the incoming message.
+ We currently support structured JSON data, denoted by `JSON` in `source.msk.data.type`. The data must always be in JSON format. If the application receives a malformed JSON, the application will fail. 
+ When using JSON as source data type, make sure that every message in all source topics is a valid JSON. If you subscribe to one or more topics that do not contain JSON objects with this setting, the application will fail. If one or more topics have a mix of structured and unstructured data, we recommended that you configure source data as unstructured in the Managed Service for Apache Flink application. 

## Can I specify parts of a message to embed?
<a name="troubleshooting-blueprints-4"></a>
+ For unstructured input data where `source.msk.data.type` is `STRING`, the application will always embed the entire message and store the entire message in the configured OpenSearch index.
+ For structured input data where `source.msk.data.type` is `JSON`, you can configure `embed.input.config.json.fieldsToEmbed` to specify which field in the JSON object should be selected for embedding. This only works for top-level JSON fields and does not work with nested JSONs and with messages containing a JSON array. Use .\$1 to embed the entire JSON.

## Can I read data from multiple Amazon MSK topics?
<a name="troubleshooting-blueprints-5"></a>

Yes, you can read data from multiple Amazon MSK topics with this application. Data from all topics must be of the same type (either STRING or JSON) or it might cause the application to fail. Data from all topics is always stored in a single OpenSearch index.

## Can I use regex to configure Amazon MSK topic names?
<a name="troubleshooting-blueprints-6"></a>

`source.msk.topic.names` does not support a list of regex. We support either a comma separated list of topic names or `.*` regex to include all topics.

## What is the maximum size of a message that can be read from an Amazon MSK topic?
<a name="troubleshooting-blueprints-7"></a>

The maximum size of a message that can be processed is limited by the Amazon Bedrock InvokeModel body limit that is currently set to 25,000,000. For more information, see [InvokeModel](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_runtime_InvokeModel.html#API_runtime_InvokeModel_RequestBody).

## What type of OpenSearch is supported?
<a name="troubleshooting-blueprints-8"></a>

We support both OpenSearch domains and collections. If you are using an OpenSearch collection, make sure to use a vector collection and create a vector index to use for this application. This will let you use the OpenSearch vector database capabilities for querying your data. To learn more, see[Amazon OpenSearch Service’s vector database capabilities explained](https://aws.amazon.com/blogs/big-data/amazon-opensearch-services-vector-database-capabilities-explained/).

## Why do I need to use a vector search collection, vector index, and add a vector field in my OpenSearch Serverless colelction?
<a name="troubleshooting-blueprints-9"></a>

The *vector search* collection type in OpenSearch Serverless provides a similarity search capability that is scalable and high performing. It streamlines building modern machine learning (ML) augmented search experiences and generative artificial intelligence (AI) applications. For more information, see [Working with vector search collections](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-vector-search.html?icmpid=docs_console_unmapped).

## What should I set as the dimension for my vector field?
<a name="troubleshooting-blueprints-10"></a>

Set the dimension of the vector field based on the embedding model that you want to use. Refer to the following table, and confirm these values from the respective documentation.


**Vector field dimensions**  

| Amazon Bedrock vector embedding model name | Output dimension support offered by the model | 
| --- | --- | 
|  Amazon Titan Text Embeddings V1  | 1,536 | 
|  Amazon Titan Text Embeddings V2  | 1,024 (default), 384, 256 | 
|  Amazon Titan Multimodal Embeddings G1  | 1,024 (default), 384, 256 | 
|  Cohere Embed English  | 1,024 | 
|  Cohere Embed Multilingual  | 1,024 | 

## What does the output look like in the configured OpenSearch index?
<a name="troubleshooting-blueprints-11"></a>

Every document in the OpenSearch index contains following fields:
+ **original\$1data**: The data that was used to generate embeddings. For STRING type, it is the entire message. For JSON object, it is the JSON object that was used for embeddings. It could be the entire JSON in the message or specified fields in the JSON. For example, if name was selected to be embedded from incoming messages, the output would look as follows:

  ```
  "original_data": "{\"name\":\"John Doe\"}"
  ```
+ **embedded\$1data**: A vector float array of embeddings generated by Amazon Bedrock
+ **date**: UTC timestamp at which the document was stored in OpenSearch

## Can I specify metadata fields to add to the document stored in the OpenSearch index?
<a name="troubleshooting-blueprints-12"></a>

No, currently, we do not support adding additional fields to the final document stored in the OpenSearch index.

## Should I expect duplicate entries in the OpenSearch index?
<a name="troubleshooting-blueprints-13"></a>

Depending on how you configured your application, you might see duplicate messages in the index. One common reason is application restart. The application is configured by default to start reading from the earliest message in the source topic. When you change the configuraiton, the application restarts, and processes all messages in the topic again. To avoid re-processing, refer to the documentation on how to use source.msk.starting.offset and correctly set the starting offset for your application.

## Can I send data to multiple OpenSearch indices?
<a name="troubleshooting-blueprints-14"></a>

No, the application supports storing data to a single OpenSearch index. To setup vectorization output to multiple indices, you must deploy separate Managed Service for Apache Flink applications.

## Can I deploy multiple real-time vector embedding applications in a single AWS account?
<a name="troubleshooting-blueprints-15"></a>

Yes, you can deploy multiple real-time vector embedding Managed Service for Apache Flink applications in a single AWS account if every application has a unique name.

## Can multiple real-time vector embedding applications use the same data source or sink?
<a name="troubleshooting-blueprints-16"></a>

Yes, you can create multiple real-time vector embedding Managed Service for Apache Flink applications that read data from the same topic(s) or store data in the same index.

## Does the application support cross-account connectivity?
<a name="troubleshooting-blueprints-17"></a>

No, for the application to run successfully, the Amazon MSK cluster and the OpenSearch collection must be in the same AWS account where you are trying to setup your Managed Service for Apache Flink application.

## Does the application support cross-Region connectivity?
<a name="troubleshooting-blueprints-18"></a>

No, the application only allows you to deploy an Managed Service for Apache Flink application with an Amazon MSK cluster and an OpenSearch collection in the same Region of the Managed Service for Apache Flink application.

## Can my Amazon MSK cluster and OpenSearch collection be in different VPCs or subnets?
<a name="troubleshooting-blueprints-19"></a>

Yes, we support Amazon MSK cluster and OpenSearch collection in different VPCs and subnets as long as they are in the same AWS account. See (General MSF troubleshooting) to make sure your setup is correct.

## What embedding models are supported by the application?
<a name="troubleshooting-blueprints-20"></a>

Currently, the application supports all models that are supported by Bedrock. These include:
+ Amazon Titan Embeddings G1 - Text
+  Amazon Titan Text Embeddings V2
+  Amazon Titan Multimodal Embeddings G1 
+  Cohere Embed English 
+  Cohere Embed Multilingual 

## Can I fine-tune the performance of my application based on my workload?
<a name="troubleshooting-blueprints-21"></a>

Yes. The throughput of the application depends on a number of factors, all of which can be controlled by the customers: 

1. **AWS MSF KPUs**: The application is deployed with default parallelism factor 2 and parallelism per KPU 1, with automatic scaling turned on. However, we recommend that you configure scaling for the Managed Service for Apache Flink application according to your workloads. For more information, see [Review Managed Service for Apache Flink application resources](https://docs.aws.amazon.com/managed-flink/latest/java/how-resources.html).

1. **Amazon Bedrock**: Based on the selected Amazon Bedrock on-demand model, different quotas might apply. Review service quotas in Bedrock to see the workload that the service will be able to handle. For more information, see [Quotas for Amazon Bedrock](https://docs.aws.amazon.com/bedrock/latest/userguide/quotas.html).

1. **Amazon OpenSearch Service**: Additionally, in some situations, you might notice that OpenSearch is the bottleneck in your pipeline. For scaling information, see OpenSearch scaling [Sizing Amazon OpenSearch Service domains](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/sizing-domains.html).

## What Amazon MSK authentication types are supported?
<a name="troubleshooting-blueprints-22"></a>

We only support the IAM MSK authentication type.

## What is `sink.os.bulkFlushIntervalMillis` and how do I set it?
<a name="troubleshooting-blueprints-23"></a>

When sending data to Amazon OpenSearch Service, the bulk flush interval is the interval at which the bulk request is run, regardless of the number of actions or the size of the request. The default value is set to 1 millisecond.

While setting a flush interval can help to make sure that data is indexed timely, it can also lead to increased overhead if set too low. Consider your use case and the importance of timely indexing when choosing a flush interval.

## When I deploy my Managed Service for Apache Flink application, from what point in the Amazon MSK topic will it begin reading messages?
<a name="troubleshooting-blueprints-24"></a>

The application will start reading messages from the Amazon MSK topic at the offset specified by the `source.msk.starting.offset` configuration set in the application’s runtime configuration. If `source.msk.starting.offset` is not explicitly set, the default behavior of the application is to start reading from the earliest available message in the topic.

## How do I use `source.msk.starting.offset`?
<a name="troubleshooting-blueprints-25"></a>

Explicitly set s`ource.msk.starting.offset` to one of the following values, based on desired behavior: 


+  EARLIEST: The default setting, which reads from oldest offset in the partition. This is a good choice especially if: 
  +  You have newly created Amazon MSK topics and consumer applications.
  +  You need to replay data, so you can build or reconstruct state. This is relevant when implementing the event sourcing pattern or when initializing a new service that requires a complete view of the data history. 
+ LATEST: The Managed Service for Apache Flink application will read messages from the end of the partition. We recommend this option if you only care about new messages being produced and don't need to process historical data. In this setting, the consumer will ignore the existing messages and only read new messages published by the upstream producer.
+ COMMITTED: The Managed Service for Apache Flink application will start consuming messages from the committed offset of the consuming group. If the committed offset doesn't exist, the EARLIEST reset strategy will be used. 

## What chunking strategies are supported?
<a name="troubleshooting-blueprints-26"></a>

We are using the [langchain](https://js.langchain.com/v0.1/docs/get_started/introduction/) library to chunk inputs. Chunking is only applied if the length of the input is greater than the chosen `maxSegmentSizeInChars`. We support the following five chunking types:
+ `SPLIT_BY_CHARACTER`: Will fit as many characters as it can into each chunk where each chunk length is no greater than maxSegmentSizeInChars. Doesn’t care about whitespace, so it can cut off words.
+ `SPLIT_BY_WORD`: Will find whitespace characters to chunk by. No words are cut off.
+ `SPLIT_BY_SENTENCE`: Sentence boundaries are detected using the Apache OpenNLP library with the English sentence model.
+ `SPLIT_BY_LINE`: Will find new line characters to chunk by.
+ `SPLIT_BY_PARAGRAPH`: Will find consecutive new line characters to chunk by.

The splitting strategies fall back according to the preceding order, where the larger chunking strategies like `SPLIT_BY_PARAGRAPH` fall back to `SPLIT_BY_CHARACTER`. For example, when using `SPLIT_BY_LINE`, if a line is too long then the line will be sub-chunked by sentence, where each chunk will fit in as many sentences as it can. If there are any sentences that are too long, then it will be chunked at the word-level. If a word is too long, then it will be split by character.

## How do I read records in my vector datastore?
<a name="troubleshooting-blueprints-27"></a>

1. When `source.msk.data.type` is `STRING`
   + **original\$1data**: The entire original string from the Amazon MSK message.
   + **embedded\$1data**: Embedding vector created from `chunk_data` if it is not empty (chunking applied) or created from `original_data` if no chunking was applied.
   + **chunk\$1data**: Only present when the original data was chunked. Contains the chunk of the original message that was used to create the embedding in `embedded_data`.

1. When `source.msk.data.type` is `JSON`
   + **original\$1data**: The entire original JSON from the Amazon MSK message *after* JSON key filtering is applied. 
   + **embedded\$1data**: Embedding vector created from `chunk_data` if it is not empty (chunking applied) or created from `original_data` if no chunking was applied.
   + **chunk\$1key**: Only present when the original data was chunked. Contains the JSON key that the chunk is from in `original_data`. For example, it can look like `jsonKey1.nestedJsonKeyA` for nested keys or *metadata* in the example of `original_data`.
   + **chunk\$1data**: Only present when the original data was chunked. Contains the chunk of the original message that was used to create the embedding in `embedded_data`.

Yes, you can read data from multiple Amazon MSK topics with this application. Data from all topics must be of the same type (either STRING or JSON) or it might cause the application to fail. Data from all topics is always stored in a single OpenSearch index.

## Where can I find new updates to the source code?
<a name="troubleshooting-blueprints-28"></a>

Go to [https://github.com/awslabs/real-time-vectorization-of-streaming-data/releases](https://github.com/awslabs/real-time-vectorization-of-streaming-data/releases) to check for new releases.

## Can I make a change to the AWS CloudFormation template and update the Managed Service for Apache Flink application?
<a name="troubleshooting-blueprints-29"></a>

No, making a change to the AWS CloudFormation template does not update the Managed Service for Apache Flink application. Any new change in AWS CloudFormation implies a new stack needs to be deployed.

## Will AWS monitor and maintain the application on my behalf?
<a name="troubleshooting-blueprints-30"></a>

No, AWS will not monitor, scale, update or patch this application on your behalf. 

## Does this application move my data outside my AWS account?
<a name="troubleshooting-blueprints-31"></a>

All data read and stored by the Managed Service for Apache Flink application stays within your AWS account and never leaves your account.

# Real-time vector embedding blueprints - troubleshooting
<a name="troubleshooting-blueprints-TS"></a>

Review the following troubleshooting topics about real-time vector embedding blueprints. For more information about real-time vector embedding blueprints, see [Real-time vector embedding blueprints](https://docs.aws.amazon.com/msk/latest/developerguide/ai-vector-embedding-integration-learn-more.html).

**Topics**
+ [My CloudFormation stack deployment has failed or rolled back. What can I do to fix it?](#troubleshooting-blueprints-deployment)
+ [I don't want my application to start reading messages from the beginning of the Amazon MSK topics. What do I do?](#troubleshooting-blueprints-beginning)
+ [How do I know if there is an issue with my Managed Service for Apache Flink application and how can I debug it?](#troubleshooting-blueprints-debug)
+ [What are the key metrics that I should be monitoring for my Managed Service for Apache Flink application?](#troubleshooting-blueprints-metrics)

## My CloudFormation stack deployment has failed or rolled back. What can I do to fix it?
<a name="troubleshooting-blueprints-deployment"></a>
+ Go to your CFN stack and find the reason for the stack failure. It could be related to missing permissions, AWS resource name collisions, among other causes. Fix the root cause of the deployment failure. For more information, see the [ CloudWatch troubleshooting guide](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/troubleshooting.html#basic-ts-guide).
+  [Optional] There can only be one VPC endpoint per service per VPC. If you deployed multiple real-time vector embedding blueprints to write to the Amazon OpenSearch Service collections in the same VPC, they might be sharing VPC endpoints. These might either already be present in your account for the VPC, or the first real-time vector embedding blueprint stack will create VPC endpoints for Amazon Bedrock and Amazon OpenSearch Service that will be used by all other stacks deployed in your account. If a stack fails, check if that stack created VPC endpoints for Amazon Bedrock and Amazon OpenSearch Service and delete them if they are not used anywhere else in your account. For steps for deleting VPC endpoints, refer to the documentation on how to safely delete your application.
+ There might be other services or applications in your account using the VPC endpoint. Deleting it might create network disruption for other services. Be careful in deleting these endpoints.

## I don't want my application to start reading messages from the beginning of the Amazon MSK topics. What do I do?
<a name="troubleshooting-blueprints-beginning"></a>

You must explicitly set `source.msk.starting.offset` to one of the following values, depending on the desired behavior:
+ **Earliest offset**: The oldest offset in the partition.
+ **Latest offset**: Consumers will read messages from the end of the partition.
+ **Committed offset**: Read from the last message the consumer processed within a partition.

## How do I know if there is an issue with my Managed Service for Apache Flink application and how can I debug it?
<a name="troubleshooting-blueprints-debug"></a>

Use the [Managed Service for Apache Flink troubleshooting guide](https://docs.aws.amazon.com/managed-flink/latest/java/troubleshooting-runtime.html) to debug Managed Service for Apache Flink related issues with your application.

## What are the key metrics that I should be monitoring for my Managed Service for Apache Flink application?
<a name="troubleshooting-blueprints-metrics"></a>
+ All metrics available for a regular Managed Service for Apache Flink application can help you monitor your application. For more information, see [Metrics and dimensions in Managed Service for Apache Flink](https://docs.aws.amazon.com/managed-flink/latest/java/metrics-dimensions.html).
+ To monitor Amazon Bedrock metrics, see [Amazon CloudWatch metrics for Amazon Bedrock](https://docs.aws.amazon.com/bedrock/latest/userguide/monitoring.html#runtime-cloudwatch-metrics).
+ We have added two new metrics for monitoring performance of generating embeddings. Find them under the `EmbeddingGeneration` operation name in CloudWatch. The two metrics are:
  + **BedrockTitanEmbeddingTokenCount**: Number of tokens present in a single request to Amazon Bedrock.
  + **BedrockEmbeddingGenerationLatencyMs**: Reports the time taken to send and receive a response from Amazon Bedrock for generating embeddings, in milliseconds.
+ For Amazon OpenSearch Service serverless collections, you can use metrics such as `IngestionDataRate`, `IngestionDocumentErrors` and others. For more information, see [Monitoring OpenSearch Serverless with Amazon CloudWatch](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/monitoring-cloudwatch.html).
+ For OpenSearch provisioned metrics, see [Monitoring OpenSearch cluster metrics with Amazon CloudWatch](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/managedomains-cloudwatchmetrics.html).