ttconv Simplifies Subtitling and Captioning

Article Featured Image

Recommendation engines are strange things. Recently my better half has been retraining from interpreting and translation to work with subtitling. As a family, we always have captioning/subtitling (we will break out the differences later) active, since my daughter has significant hearing loss on one ear.  

And of course, I work with streaming, so I guess I shouldn’t have been surprised when David Ronca, director of video encoding at Facebook, appeared at the top of my LinkedIn newsfeed with this post:

David Ronca LinkedIn Post

 

Now, while I have worked with emerging streaming video tech for 25+ years, subtitles have always been "yet another data stream" that we need to handle rather than "a data stream I have to understand the particulars of." Like many readers, I have heard of .srt and WebVTT, and .stl files. I have merged them with compressed video and audio and packaged them into container formats and shipped them in quantity to audiences all over the world. I felt I was familiar with subtitling and timed text.

Then I found myself, on the same day I saw David's post, looking over my (fairly non-technical) wife's shoulder as she was being trained to use a subtitling system. I saw minute timing and position on an edit-decision timeline in an application entirely focused on optimising the subtitle data tied to particular cuts and edits of programming, and I realised I really had no idea about the details of subtitle formats. 

And while most of us in the industry could talk about the history of audio and video compression, and distribution formats, few of us really understand timed text, the protocols and data structures that are in the market, and why they exist.

So I reached out to the OP of the post that David had shared: Pierre-Anthony Lemieux, to ask him for a whistle stop tour of the nuances of timed text, and to explain more about what ttconv is, and why David (and others) are excited about it.

Pierre-Anthony has been very focussed on the specifics of Timed-Text and Subtitling for around 8 years. It has been an interesting time. These technologies obviously date back many years before even the Internet was an idea in Vint Cerf and Bob Kahn’s mind.  

Of course text was critical in some of the very earliest movies. 

Intertitle

Intertitles were critical in the silent movie era for obvious reasons. Simply stopping the film, to display an intertitle, then to return to the programming meant that localization of films for international multi-language distribution was really simple. Simply translate and swap out the intertitles and the film can suddenly reach wider global audiences. For the first 20 years of movies this was great, but then in 1927 sound arrived and this made intertitles largely redundant. And so the complexity of dubbing audio (or even completely refilming!) for multilingual audiences emerged. 

This led to scrolling captions and the very earliest attempts to provide both timed text with localization that could be versioned in post.

Through the subsequent 50 or 60 years, localization was the key driver for creating timed text—new markets and new revenue all make the extra production cost worthwhile. It wasn't until 1980 when ABC, NBC, and PBS debuted closed captioning for the deaf, that accessibility rightly also became a key driver behind these technologies. Se below for a list of definitions for the different types of timed text, including subtitles, captioning, and more.

Timed Text Embraces both Subtitles and Closed Captioning

Annex B     Timed Text Definitions (Normative)

B.1           Subtitles

Subtitles are a textual representation of the audio track, usually just the dialog and usually in a language other than the audio track dialog, intended for foreign language audience.

B.2           Captions for the Hearing Impaired

Captions for the Hearing Impaired are a textual representation of the audio track, usually including all sounds, and usually in the same language as the audio track dialog, intended for hearing impaired audiences.

B.3           Text for the Visually Impaired

Text for the Visually Impaired is a textual description of visual elements of the content and usually in the same language as the audio track dialog, intended for visually impaired audiences.

B.4           Commentary

Commentary provides extra information about the associated content (e.g. Producer Commentary) usually in the same language as the audio track dialog.

B.5           Karaoke

Karaoke is a textual representation of songs’ lyrics, usually in the same language as the associated song.

B.6           Forced Narrative

Timed text related to foreign or alien language or translation of text that appears in media, such as in a sign, that is intended to be displayed if no other timed text, such as subtitles or captions, is enabled.

Annex B     Timed Text Definitions (Normative)

B.1           Subtitles

Subtitles are a textual representation of the audio track, usually just the dialog and usually in a language other than the audio track dialog, intended for foreign language audience.

B.2           Captions for the Hearing Impaired

Captions for the Hearing Impaired are a textual representation of the audio track, usually including all sounds, and usually in the same language as the audio track dialog, intended for hearing impaired audiences.

B.3           Text for the Visually Impaired

Text for the Visually Impaired is a textual description of visual elements of the content and usually in the same language as the audio track dialog, intended for visually impaired audiences.

B.4           Commentary

Commentary provides extra information about the associated content (e.g. Producer Commentary) usually in the same language as the audio track dialog.

B.5           Karaoke

Karaoke is a textual representation of songs’ lyrics, usually in the same language as the associated song.

B.6           Forced Narrative

Timed text related to foreign or alien language or translation of text that appears in media, such as in a sign, that is intended to be displayed if no other timed text, such as subtitles or captions, is enabled.

(from SMPTE ST 2067-2 Interoperable Master Format — Core Constraints, (c) SMPTE")

In 1990, the Television and Circuitry Act mandated Line21 reservation and caption decoders be built into TVs. But it wasn’t until 1996 that regulatory forces acknowledged that hard-of-hearing audiences were inconsistently able to consume video/TV/film media and drafted policy mandating that TV broadcasting should always contain captioning. It took until 2010 to mandate that other forms of video—those pertinent to our industry—should be produced with Captioning.

Lemieux highlighted that it was the emergence of global streaming platforms—with new, complex localization requirements—that were one of the critical influences on this legislative landscape.

Traditional content windowing meant that an English film would first roll out in English-speaking countries purely with English captioning. Then, if demand/response from the market was good, the extra production overhead of foreign language subtitling or dubbing would be added on a territory-by-territory basis as the content was windowed for those countries.

But while legislation for accessibility had been the key driver to include timed-text for much of the last century, global market forces—i.e., streaming— have once again expanded on that. Move forward to the PC/tablet/mobile phone era (too soon to say "Post-Covid?") and today many releases are captioned ready for launch into typically 10 language territories from the outset. Netflix is reaching approximately 30 localization languages. 

While by 1988 only around 200,000 captioning systems had been sold, today captioning technology is delivered scalably and cheaply, sometimes as a web app, with millions of professional subtitlers (like my wife will hopefully qualify to be soon) working on an endless sea of new content.

The Tech Behind Timed Text

So at this point in the journey, it was time to look into the tech: I asked Pierre-Anthony a (purposefully naive) fun question: "Isn’t it just a time-stamps and some ASCII??" Here's his response:

Yes, on one level it is … but there are now a number of legacy systems that produce that timed text formatted in specific ways. We have .srt files which evolved from DVD, and we have SCC and STL which have evolved from US and European broadcast broadcast.  We have formats that are specific to Karaoke. We have standards that include positioning information—so timed text can also be overlaid over the speaker when filming contains a group of speakers. We have standards that enable coloring—again to further help the viewer separate speakers in the flow of dialog.

As we look at different languages, there is obviously "directionality"—some scripts read right to left, or even vertically, and there are differing approaches even within isolated languages. Japanese, as well as having combined vertical/horizontal text, can have optional 'ruby' characters ('furigana') added to the basic Kanji text, that helps less experienced readers pick out subtle pronunciations that may vary meaning. Placing these characters is highly nuanced, but of course needs to be expressible in timed text formatting and often present very challenging rendering problems. 

Each of these has its own production workflow legacy, and this means that there are many legacy tools widely in use and the resulting timed-text data is not suited to modern standardisation. Many of these diverse formats are found extensively in large content archives, and must be translated before the content can be repurposed to modern workflows. As the use of that media is changing so too are the requirements on the conformity and portability of these timed text formats.

This is where ttconv comes in. ttconv is a format converter. We have introduced it to help to translate legacy formats to what has been widely adopted as the Timed Text 'Lingua Franca’ which is the W3C standardised IMSC format.

With this robust explanation of what ttconv is, and the problem it is addressing I also followed up with David Ronca.  

I wanted to know why ttconv is solving a problem for Facebook in particular (the source of the original post).  "Facebook provides a great platform for watching video content, and subtitles are an important part of this platform, enabling accessibility and localization," Ronca says. "While IMSC 1.1 is the modern format for global subtitles, a significant amount of global video content has subtitles in legacy formats. We felt that the industry needed a modern, high-quality open-source tool to validate and convert these legacy subtitle formats to IMSC 1.1, and we partnered with Pierre and Sandflow to build the new tool."

If he saw a pattern with (peer companies adopting) IMSC1.1 widely. Is this a ‘movement’ the industry should more widely participate in? "IMSC1.1 is the only timed text model that supports subtitles for all global languages, including subtitles with complex formatting and rendering like Japanese," Ronca says. "We do believe that going forward, all new subtitle assets should be authored in IMSC1.1." 

Finally, I asked both if they had advice for Streaming Media readers seeking to improve timed text and subtitling workflows end to end. 

"My personal advice to everyone in the industry is to move their tools and workflows to IMSC 1.1, and to use IMSC1.1 for all new subtitle assets that are created," Ronca says. "Further, I recommend to integrate TTV into workflows to insure that all incoming and outgoing subtitle assets are conformed to the IMSC1.1 specification. This will maximize subtitle interoperability. Finally, treat subtitles as a first-class citizen in your streaming system. Demand the same quality and QoS for subtitles as you do for audio and video."

"Treat timed text on equal footing as audio and video throughout the chain," adds Lemieux. "This means, for example, timed text being an integral part of distribution masters. Invest in timed text authoring practices and formats that allow the entire world to be reached. This means educating both upstream caption/subtitle providers and downstream platforms by, for example, creating authoring guidelines, samples, conversion and validation tools (like ttconv), etc." 

To conclude: Timed text is not simply an "opt in" for deaf people, or those who want to watch foreign movies: One 2016 survey showed that 85% of all video on Facebook is watched with the sound off. So next time you are skimming a news feed, or watching a sport channel while you are in a noisy bar, but can follow along thanks to captioning and subtitles, stop for a moment and think about the subtle but important role that timed text plays in day to day lives. It's truly an unsung hero of our industry.

Streaming Covers
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues
Related Articles

How to Provision Live & VOD for Multi-language, Captions & ASL

As legal requirements and ever more diverse audiences demand multi-language captions, localization, and ASL support in their live and on-demand streaming content, how can content developers and producers remain on top of both regulatory requirements and the technical demands of provisioning their content for maximum accessibility? LiveX's Corey Behnke and Allan McLennan of 2G Digital Optimizations discuss in this clip from their panel at Streaming Media East 2023.