You can use multimodal RAG to search documents such as PDFs, images, or videos (available for Amazon Nova Lite and Amazon Nova Pro). With Amazon Nova multimodal understanding capabilities, you can build RAG systems with mixed data that contains both text and images. You can do this either through Amazon Bedrock Knowledge bases or through building a custom multimodal RAG system.
To create a multimodal RAG system:
-
Create a database of multimodal content.
-
Run Inference in multimodal RAG systems for Amazon Nova.
-
Enable users to query the content
-
Return the content back to Amazon Nova
-
Enable Amazon Nova to respond to the original user query.
-
Creating a custom multimodal RAG system with
Amazon Nova
To create a database of multimodal content with Amazon Nova, you can use one of two common approaches. The accuracy of either approach is dependent on your specific application.
Creating a vector database using multimodal embeddings.
You can create a vector database of multimodal data by using an embeddings model such as Titan multimodal embeddings. To do this, you first need to parse documents into text, tables, and images efficiently. Then, to create your vector database, pass the parsed content to the multimodal embeddings model of choice. We recommend to connect the embeddings to the portions of the document in their original modality so that the retriever can return the search results in the original content modality.
Creating a vector database using text embeddings.
To use a text embeddings model you can use Amazon Nova to convert images into text. Then you create a vector database by using a text embeddings model such as the Titan Text Embeddings V2 model.
For documents such as slides and infographics, you can turn each part of the document into a text description and then create a vector database with the text descriptions. To create a text description use Amazon Nova through the Converse API with a prompt such as:
You are a story teller and narrator who will read an image and tell all the details of the image as a story. Your job is to scan the entire image very carefully. Please start to scan the image from top to the bottom and retrieve all important parts of the image. In creating the story, you must first pay attention to all the details and extract relevant resources. Here are some important sources: 1. Please identify all the textual information within the image. Pay attention to text headers, sections/subsections anecdotes, and paragraphs. Especially, extract those pure-textual data not directly associated with graphs. 2. please make sure to describe every single graph you find in the image 3. please include all the statistics in the graph and describe each chart in the image in detail 4. please do NOT add any content that are not shown in the image in the description. It is critical to keep the description truthful 5. please do NOT use your own domain knowledge to infer and conclude concepts in the image. You are only a narrator and you must present every single data-point available in the image. Please give me a detailed narrative of the image. While you pay attention to details, you MUST give the explanation in a clear English that is understandable by a general user.
Amazon Nova will then respond with a text description of the provided image. The text descriptions can then be sent to the text embeddings model to create the vector database.
Alternatively, for text intensive docs such as pdfs, it might be better to parse the images from the text (it depends on your specific data and application). To do this, you first need to parse documents into text, tables, and images efficiently. The resulting images can then be converted to text using a prompt like the one shown above. Then, the resulting text descriptions of the images and any other text can be sent to a text embeddings model to create a vector database. It is recommended to connect the embeddings to the portions of the document in their original modality so that the retriever can return the search results in the original content modality.
Running inference in RAG systems for Amazon Nova
After you've set up your vector database, you can now enable user queries to search the database, send the retrieved content back to Amazon Nova and then, using the retrieved content and the user query, enable Amazon Nova models to respond to the original user query.
To query the vector database with text or multimodal user queries, follow the same design choices that you would when performing RAG for text understanding and generation. You can either use Amazon Nova with Amazon Bedrock Knowledge Bases or build a Custom RAG system with Amazon Nova and Converse API.
When the retriever returns content back to the model, we recommend that you use the content in its original modality. So if the original input is an image, then return the image back to Amazon Nova even if you converted the images to text for the purposes of creating text embeddings. To return images more effectively, we recommended that you use this template to configure the retrieved content for use in the converse API:
doc_template = """Image {idx} : """
messages = []
for item in search_results:
messages += [
{
"text": doc_template.format(idx=item.idx)
},
{
"image": {
"format": "jpeg",
# image source is not actually used in offline inference
# images input are provided to inferencer separately
"source": {
"bytes": BASE64_ENCODED_IMAGE
}
}
}
]
messages.append({"text": question})
system_prompt = """
In this session, you are provided with a list of images and a user's question, your job is to answer the user's question using only information from the images.
When give your answer, make sure to first quote the images (by mentioning image title or image ID) from which you can identify relevant information, then followed by your reasoning steps and answer.
If the images do not contain information that can answer the question, please state that you could not find an exact answer to the question.
Remember to add citations to your response using markers like %[1]%, %[2]% and %[3]% for the corresponding images."""
Using the retrieved content and the user query in the Converse API, you can invoke the converse API and Amazon Nova will either generate a response or request an additional search. What happens depends on your instructions or whether the retrieved content effectively answered the user query.