Automatic Caption Alignment for Video Streaming

Article Featured Image

Originally developed to aid the hearing-impaired, closed captioning has become an integral part of the viewing experience. However, a slight misalignment in audio and captions—either from editing, ad-insertion, or frame rate/format conversion—can completely defeat its purpose. Instead of aiding in comprehension, a caption arriving early or late can create confusion. To ensure a high-quality experience for viewers, it’s essential that streaming video providers synchronize the caption text with the audio in a media file. This article will discuss how to synchronize captioning using automatic caption alignment, the challenges with this approach, and how to overcome these issues with machine learning.

Challenges in Automatic Caption Alignment

The automatic caption alignment technique involves synchronizing caption text with audio in a media file using machine learning. First, a rough-and-ready transcript of the audio is created using automated speech recognition (ASR). Then, captions are aligned with the transcribed text. This approach can identify and correct different types of alignment issues, like addition, deletion, and shift. Addition refers to a situation in which a certain audio segment has not been captioned. Deletion is a case wherein an audio segment has been edited out, but the corresponding captions were not deleted. Shifting is the most common scenario and refers to captions being shifted by a few seconds when scenes are added or removed. Such shifts can easily be identified by aligning captions with the transcribed text.

While the above approach works well, there are some challenges that stem from ASR-generated transcripts. Primary among these is accuracy, which can be affected by background noise, music, and variations in the tone, pitch, and speed of dialogue. In addition to transcript accuracy, other issues with automatic caption alignment include:

  1. When dialogue repeats within a short period, there is a chance of a misalignment. As an example, we’ll refer to a case where two consecutive caption utterances are the same.

consecutive caption utterances

If the transcriber doesn’t generate the Caption Segment 1 with high confidence due to background noise, then the corresponding audio will end up aligned with the second segment, which will further impact the alignment of neighboring segments.

  1. When caption sentences are aligned one by one, independent of others, sometimes the results of consecutive captions overlap, and it’s difficult to judge which caption is aligned incorrectly. It happens frequently when the end of one sentence is similar to the beginning of the next one.

consecutive captions overlap

If there is background noise and “area” gets transcribed only once, then it might be difficult to judge which caption it was detected for.

  1. Mismatches from homophones—e.g., “right” and “write”—in the transcript. Such mismatches often impact alignment algorithms and their output.

  2. Disfluencies are typically not transcribed by ASR models. Hence, disfluencies present in the captions—but not in the transcript—also make alignment difficult.

  3. Captions may also contain plurals, while the transcript does not, or vice versa.

  4. Spelling mistakes: Captions may have a few spelling mistakes. This generally happens with names, as the same pronunciation may have different spellings. Likewise, this is a common problem for scenarios in which captions were generated in a live environment and hence contain a lot of errors.

  5. Numerical tokens with multiple possible representations in speech. For example: 1995 can be read as “nineteen ninety-five,” “nineteen hundred and ninety five,” or “one thousand, nine hundred, and ninety five.” Matching captions to transcribed text often becomes difficult in such cases.

  6. The presence of symbols, abbreviations, contractions (e.g.: $, @, I’ve, breakin’, could’ve) in the transcript can also result in a mismatch.

Overcoming the Challenges Through Machine Learning

The above issues can be solved to an extent with the right usage of machine learning technologies. Natural language processing (NLP) can be used to pre-process text. NLP is capable of generating multiple representations of numerical tokens so that the matching is improved.

Furthermore, stop words in captions (which are very frequently used words in a language) can be spotted using NLP, and this information can be used while aligning the captions. For the English language, stop words would include “is”, “am,” “are,” “the,” etc. Since these words are repeated so many times in captions, we might tend to give less importance to their presence/absence while aligning.

captions alignment

To make the transcript more accurate, a dictionary containing context-specific words and names can be provided to the ML-based transcriber. For example, if the media file is a sports commentary, the players' names and tournament name can be added to the transcriber dictionary, which will resolve issues specific to spelling mistakes for names.

For alignment, one can use dynamic programming to overcome issues related to the repetition of dialogues and overlaps. Sequential alignment of captions focuses on aligning only the current caption by finding its best match, whereas the alignment using dynamic programming works on larger scenarios, as it focuses on the overall best match for a group of captions. To use dynamic programming for alignment, all captions are first picked one by one. Then, for each caption, the N possible alignments are found in decreasing order of match score. Now, from these alignments, the final match for individual captions will be selected so that there is no overlap and the total alignment score (a number representing confidence in finding a caption) for the whole block is the highest. The match selected for any individual caption may not be the one with the highest score, but the sum of the scores for all selected matches will be the highest. This way an optimal alignment can be ensured.

To handle the inaccuracies of text, instead of finding exact words in transcripts, one can perform fuzzy searching using the Levenshtein distance for word spotting. It resolves the issues related to spelling mistakes and homophones. It also helps in the matching of numerical tokens. After this alignment, if there are still some captions that remained unaligned—mostly these will be music, audio description, and noise—then time can be assigned using statistical predictions.

Statistical Predictions of Time for Unaligned Captions

If there are still captions that remain unaligned, they can be aligned using shifts in the surrounding captions that have high confidence scores. To do this, one can create blocks of continuous unaligned captions and assign them time by statistical prediction of time. There can be two types of files, those that contain a shift and those with a drift. A shift is a constant misalignment in any segment of audio and caption. If this misalignment is present in the whole file and keeps increasing or decreasing at a constant rate, it’s called a drift.

For files having shifts, one can calculate the mean shift in the surrounding captions of unaligned blocks by only considering captions that are aligned with high confidence. One can use this mean shift to give time to unaligned blocks.

In drift files, the shift becomes very large at the end of the file. If a chunk of unaligned captions is present, then the drift rate can be calculated according to the portion that has been aligned confidently. This drift rate is used to give time to unaligned segments. Using drift rate, probable shift can be calculated for every unaligned caption.

Caption Alignment for Subtitles

The alignment process can be extended to align audio and subtitles, both in different languages. Consider a case where we are trying to align English audio with Spanish subtitles.

In the first pass, we can transcribe English audio to generate English text and then do a raw comparison between English and Spanish segments. We expect segments to overlap for a majority of parts with slight deviations. If there is a particularly long segment that is only present in one of the languages, it indicates a mismatch that needs to be manually reviewed. An advanced approach includes translating Spanish text to English and then doing a detailed comparison. But translation has some drawbacks, like there can be multiple ways of saying the same thing in any language. It’s possible that “hogar” (the Spanish word for home) in the captions gets translated to “house,” but in the audio file “home” is present. This will result in a mismatch. Hence, to address such issues, ML-based semantic analysis can be used. It provides a score in terms of how closely the words are related in terms of their meaning.


Captions are a crucial component of VOD streaming services. Not only do they allow OTT service providers to extend their reach and make streaming content accessible to millions of viewers around the globe with ease, but they have also become mandatory from a regulatory standpoint. To ensure a high-quality viewing experience—while maintaining compliance with regional regulations—it’s imperative that audio and captions are in alignment. This can be achieved efficiently and cost-effectively with an auto-alignment system that utilizes machine learning. The result is a viewing experience that meets the high expectations of today’s global audiences and drives growth.

[Editor's note: The is a contributed article from Interra Systems. Streaming Media accepts vendor bylines based solely on their value to our readers.]

Streaming Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues
Related Articles

What’s Next for AI Dubbing in the Media Industry?

Anton Dvorkovich, CEO and Founder of Dubformer, writes about how AI dubbing is poised to dramatically transform the broadcast media industry as recently developed solutions are set to be widely implemented.