Understanding Amazon Chime SDK live transcription messages
The Amazon Chime SDK service shares transcription information with attendees by sending
TranscriptEvent
objects in data messages. A
TranscriptEvent
delivers a Transcript
or a
TranscriptionStatus
.
A Transcript
includes results with timestamped, user-attributed words and
punctuation. A result may be “partial”, in which case the system usually updates it in a
subsequent TranscriptEvent
. This allows you to see transcriptions quickly
and apply inline updates later as necessary.
A TranscriptStatus
may deliver one of the
TranscriptionStatusType
events, listed in the example in the next
section.
Newer versions of the Amazon Chime SDKs include additional data types and helper functions for
common processing a TranscriptEvent
.
TranscriptEvent
This example shows a typical transcription event.
type TranscriptEvent = Transcript | TranscriptionStatus; export class TranscriptEventConverter { static from(dataMessage: DataMessage): TranscriptEvent[] { // convert DataMessage to TranscriptEvents return ... } } export default class TranscriptionStatus { type: TranscriptionStatusType; eventTimeMs: number; transcriptionRegion: string; transcriptionConfiguration: string; message?: string; } enum TranscriptionStatusType { STARTED = 'started', INTERRUPTED = 'interrupted', RESUMED = 'resumed', STOPPED = 'stopped', FAILED = 'failed', } export default class Transcript { results: TranscriptResult[]; // at least one } export class TranscriptResult { resultId: string; isPartial: boolean; startTimeMs: number; endTimeMs: number; alternatives: TranscriptAlternative[]; // most confident first } export default class TranscriptAlternative { items: TranscriptItem[]; // in start time order transcript: string; //concatenated transcript items entities?: TranscriptEntity[]; } export default class TranscriptItem { type: TranscriptItemType; startTimeMs: number; endTimeMs: number; attendee: Attendee; content: string; vocabularyFilterMatch?: boolean; confidence?: number; stable?: boolean; } enum TranscriptItemType { PRONUNCIATION = 'pronunciation',// content is a word PUNCTUATION = 'punctuation',// content is punctuation } export default class TranscriptEntity { category: string; confidence: number; content: string; endTimeMs: number; startTimeMs: number; type?: string; } // This is an existing SDK model export default class Attendee { attendeeId: string; externalUserId: string; }
Data guidelines
Keep these guidelines in mind as you go.
-
transcription.results
may have more than one result. -
If
transcription.results[i].isPartial = true
, then there may be an update for the entire result. The update is likely, but not guaranteed. The update has the sametranscript.result[i].resultId
. If you want to avoid low-confidence transcriptions, you can skip partial results completely. If you want low-latency results, you can display partial results, then overwrite completely when the update arrives. -
transcription.results[i].alternatives
always contains at least one entry. If it contains more than one entry, the most confident entry is first in the list. In most cases, you can take the first entry intranscription.results[i].alternatives
and ignore the others. -
transcription.results[i].alternatives[j].items
includes an entry for each word or punctuation mark. -
transcription.results[i].alternatives[j].items[k].
content is what was spoken. -
transcription.results[i].alternatives[j].items[k].attendee
is the user attribution (who) of the content. -
transcription.results[i].alternatives[j].items[k].startTimeMs
is the "when" of the content. This enables word-by-word rendering of user-attributed transcription across different users in the order that the words were spoken. -
The
transcription.results[i].alternatives[j].items[k].endTimeMs
field can generally be ignored, but is supplied for completeness of who said what when. -
transcription.results[i].alternatives[j].items[k].vocabularyFilterMatch
is true if the content matched a word in the filter, otherwise it is false. -
transcription.results[i].alternatives[j].items[k].confidence
is a value between 0 and 1. It indicates the engine's confidence that the item content correctly matches the spoken word, with 0 being the lowest confidence and 1 being the highest confidence. -
transcription.results[i].alternatives[j].items[k].stable
indicates whether the current word will change in future partial result updates. This value can only be true if you enable the partial results stabilization feature by settingEnablePartialResultsStabilization
totrue
in your request. -
transcription.results[i].alternatives[j].entities
includes an entry for each entity that the Content Identification or Redaction features detect. The list is only populated if you enable Content Identification or Redaction. An entity can be data such as personally identifiable information or personal health information. You can use entities to highlight, or take action on, words of interest during transcription. -
transcription.results[i].alternatives[j].entities[k].category
is the entity's category. It equals the Content Identification or Redaction type, such as "PII" or "PHI", which is provided in the request. -
transcription.results[i].alternatives[j].entities[k].confidence
measures how strong the engine is that the particular content is truly an entity. Note that this is different than the item-level confidence, which measures how confident the engine is in the correctness of the words themselves. -
transcription.results[i].alternatives[j].entities[k].content
is the actual text that makes up the entity. This can be multiple items, such as an address. -
transcription.results[i].alternatives[j].entities[k].startTimeMs
captures the time at which the entity started being spoken. -
transcription.results[i].alternatives[j].entities[k].endTimeMs
captures the time at which the entity finished being spoken. -
transcription.results[i].alternatives[j].entities[k].type
is only supported for the Transcribe engine and provides the sub-type of the entity. These are values such as `ADDRESS`, `CREDIT_DEBIT_NUMBER`, and so on.
Registering event handlers for TranscriptEvents
The following examples use the Amazon Chime SDK client library for JavaScript. However, the pattern is consistent across all Amazon Chime SDKs.
The TranscriptionController
in the RealtimeController
and RealtimeControllerFacade
includes specific functions for adding a
handler that processes TranscriptionEvents
:
/** * Returns the [[TranscriptionController]] for this real-time controller. */ readonly transcriptionController?: TranscriptionController;
The TranscriptionController
has two functions to manage subscribing
and unsubscribing to TranscriptionEvent
callbacks:
import TranscriptEvent from './TranscriptEvent'; export default interface TranscriptionController { /** * Subscribe a callback to handle received transcript event */ subscribeToTranscriptEvent(callback: (transcriptEvent: TranscriptEvent) => void): void; /** * Unsubscribe a callback from receiving transcript event */ unsubscribeFromTranscriptEvent(callback: (transcriptEvent: TranscriptEvent) => void): void; }
Using the optional TranscriptionController
We provide a default implementation of TranscriptionController
interface named DefaultTranscriptionController
. The default
implementation in DefaultRealtimeController
and
DefaultAudioVideoFacade
returns a
DefaultTranscriptionController
object:
/** get transcriptionController(): TranscriptionController { return this.realtimeController.transcriptionController; }
DefaultRealtimeController
also takes an optional
TranscriptionController
object in its constructor. That allows you
to override the DefaultTranscriptionController
behavior. Developer
applications subscribe and unsubscribe to one or more callbacks through the
TranscriptionController
object of the AudioVideoFacade
object:
// Subscribe this.audioVideo.transcriptionController?.subscribeToTranscriptEvent(this.transcriptEventHandler); // Unsubscribe this.audioVideo.transcriptionController?.unsubscribeFromTranscriptEvent(this.transcriptEventHandler););