The Video Captioning Conundrum

Article Featured Image

Today, we are consuming more video than ever before:

  • In 2022, 82% of global internet traffic will come from either video streaming or video downloads, according to Cisco.
  • 84 minutes per day is the global average for video consumption, according to MediaSpot
  • The global live streaming market is estimated to reach over $247 billion by 2027, according to PR News Wire.

With stats like that, it’s clear video is a major part of our world, from personal to professional.

Unfortunately, while most of us engage with video quite frequently in our everyday lives, from live-streamed conferences to internal meetings at work, there are several common issues that disrupt viewer comprehension of a recorded video confirmed by the following stats:

First, it’s often difficult to understand a video when there are multiple presenters, or tech glitches disrupt the recording. While many attempt to resolve this with captioners, human captioners are costly and must sink hours into captioning. And that’s what we call the captioning conundrum: video captioning is critical for accessibility and comprehension, but is time sensitive and expensive. The time to turn around captions and costs increase if captions are required in multiple languages.

Why do we need video captioning?

To start, it’s important we answer the question: why do you even need video captioning? Today, the need for captioning and transcription on videos is critical for two major reasons: accessibility to video content and increasing viewer comprehension.

The Americans with Disabilities Act was passed in 1990 in order to protect citizens with disabilities from discrimination. Many associate the ADA landmark legal requirements like accessible parking spots, building entryways, and restrooms or water fountains. However, the ADA also requires “auxiliary aids,” like captioning or audio descriptions, to be made available to anyone with a disability for two groups: 

  • Public entities: state and local governments, in both internal and external video communication.
  • Places of public accommodations: where public or private businesses are used by the public at large. Private clubs and religious organizations are exempt.

Adding captions to pre-recorded video conferences, presentations, and meetings is one of the best ways to ensure everyone can enjoy, learn from, and engage with great video content. As an organization using VOD for internal and external meetings or events, adding captioning is adding accessibility to your workplace. Additionally, adding captioning into your organization's VOD is critical for improving viewer comprehension.

As I mentioned before, it’s common for glitches in a stream, an issue with a presenter’s Internet, or differences in accents or communication styles (think: using lots of industry-specific adjectives at a meeting with varied stakeholders) to disrupt a video viewer’s understanding of a presentation or meeting.

Adding captions to your VODs makes it easier for viewers to understand what’s going on, especially when there are multiple speakers or presenters.

There are other key non-accessible focused reasons to add captions. They can improve the organic reach of your video content, both internally and externally, more than just the standard video meta-data of title, description and tags. It can also increase engagement for viewers wanting to consume video content in sound sensitive environments such a quiet office, or public transport.

The captioning conundrum

Unfortunately, many organizations are all too familiar with the three big issues involved with adding captioning to the video: the high cost, the long, manual hours put in by a captioner, and the likelihood of human error as the quality of captioners can vary.

Historically, after a video conference or meeting, companies will hire a human to listen to a recording and manually transcribe the captions. Later, the captions are added to the recorded video and offered as written transcription.

While human captioning is very helpful, it is far from efficient or cost-effective. Human transcribing captions are quite expensive, especially when hiring a high-quality transcriber, which is what many of us want to do. Additionally, manually transcribing is a time-consuming, tedious process. This manual process can delay the release of a recorded video, and render important information outdated. Then, there’s the problem of human error. Human transcribers are prone to inevitable mistakes or typos, because, well, they’re human!

So, captioning on videos is critical for user accessibility and comprehension. But, hiring a human transcriptionist to create captioning is expensive and can drag down the release of a video, thanks to the long hours the transcriptionist must put in.

What’s the solution to the captioning conundrum then?

In order to deliver high-quality, accurate captioning quickly, we must harness the power of artificial intelligence and machine learning technology to automatically create captions for videos. The demand for video in today’s world is simply too great to continue to tackle captioning any other way.

Machine Learning Captioning Feature 

Machine Learning is the key to automatically creating captioning on recorded videos, within minutes.

However, not all AI-powered captioning services are the same. In order for captioning to be the most efficient and effective, it’s critical for the captioning function to possess three key features: editable captioning, the ability to be tailored to different dialects, and a customizable library of words, so the ML can be trained to recognize key terms like industry slang, commonly used acronyms, and presenter names.

Editable captions

First, all great ML captioning programs should offer the option to have the captioning automatically added in. This makes for the fastest turnaround and ensures video content can be delivered while still relevant and timely.

However, a great ML captioning program will also allow a user to edit captions before releasing a video. Editing captioning becomes critical when videos are high priority and need to be free from any errors or typos in the captioning. Editing captioning is also helpful when sensitive information needs to be removed from a video before release.

With editable captioning, users can download the captioning, fix errors, remove sensitive information as needed, then upload and distribute the most polished, fully captioned video.

Tailoring to dialects

Another great feature your ML captioning program needs? The ability to tailor the captioning to understand different dialects

This feature is best for helping viewers gain full comprehension of a speaker, and is especially useful when speakers are speaking the same language, but have regional tweaks.

For example, think about all of the nuances and differences between American English, British English, Australian English, Indian English, Irish English, Scottish English, and Welsh English. While a Brit saying “knackered” in a meeting, while an American English speaker would use the term “tired.” With a programmable ML captioning service, different dialects can be programmed into the captioning, to most accurately reflect differences in word choice.

Customizable library

Finally, one more key function to look for in your ML captioning platform is the ability to create a customizable library of vocabulary.

This is especially helpful for industry abbreviations, slang terms, or even presenter names that everyone watching a video might not understand. This customizable library functions helps to improve comprehension for VOD viewers, and is particularly useful for internal meetings with a lot of stakeholders.

Additionally, the library of specially customized words will create the most accurate captions possible, which cuts down on the time needed to edit captions and ultimately release a video.


Captioning is a core part of video today. When used correctly, it increases accessibility and comprehension for our video. Creative techniques can help organizations produce clear, correct captions in an efficiency manner.

There’s no need for a conundrum when it comes to captioning.

James Broberg is the Founder and CEO of StreamShark, an end-to-end Live and On-demand video streaming platform for enterprises. Prior to StreamShark, James was a research fellow in content delivery networks and cloud computing at the University of Melbourne.

[Editor's note: This is a contributed article from StreamShark. Streaming Media accepts vendor bylines based solely on their value to our readers.]

Streaming Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues
Related Articles

What’s Next for AI Dubbing in the Media Industry?

Anton Dvorkovich, CEO and Founder of Dubformer, writes about how AI dubbing is poised to dramatically transform the broadcast media industry as recently developed solutions are set to be widely implemented.