Accessibility and Localisation: How AI Can Create More Accessible Content for Larger Audiences
With key streaming services such as Disney+, Amazon, and Netflix trying to drive down production costs across the board, premium content providers have spent considerable time looking at how they can develop or license content which isn’t produced in English but can offer global appeal.
This is a rather obvious step, as films and shows in several non-English languages account for a significant portion of content consumption on Netflix. For example, in the last half of 2023, Korean accounted for 9% of total viewing; Spanish, 7%; and Japanese, 5%. Global demand for non-English content grew to 40% in 2023, nearly doubling, since it comprised 23% in 2018, according to research from Rise Studios published in February 2024.
Standout examples of non-English-language breakout hits in English markets include a wide range of content in languages other than Korean, Japanese, and Spanish, including Germany’s Dear Child, with 53 million views; Poland’s Forgotten Love, with 43 million views; and India’s The Railway Men, with 11 million views.
In fact, some of Netflix’s biggest hits are in a language other than English. The obvious example is of course Squid Game, a Korean production which is still the most viewed TV series on the service, having racked up more than 2.2 billion hours of viewing since its launch.
Amazon Prime has had similar success with Culpa Mia, a Spanish-language film, and Medellin, a 2023 French film that immediately jumped to the top 10 list of non-English language films. Culpa Mia became the number-one movie worldwide and is featured in the top 10 most watched movies in more than 190 countries.
While Disney+ doesn’t feature a lot of non-English-language content, it still has a significant reach globally, with around two-thirds of its subscribers coming from outside North America and the UK. This means it has a large requirement to translate content for roughly 100 million subscribers in more than 150 territories.
Achieving Global Reach Through AI Translation and Titling
With the amount of content being produced and the possibility of the content moving from becoming a limited market release to a global breakout hit, the need to localise content effectively has always been essential to streaming services in order to gain as much reach as possible.
Historically, making content accessible in terms of multiple-language translations—including sign language—has proven expensive to do well. But with the growth of AI services, it has become apparent that this technology can extend the reach of content while enabling services to support a wider range of languages at a lower cost.
During the last 2 years, AI services have made significant leaps forward in delivering highly accurate and contextually relevant subtitles not only for on-demand but for live content as well. Many of the available tools can also modify the subtitle delivery based on the device being used, including changing the colour, size, and positioning of the subtitles, to create a better user experience.
Several Big Tech companies have shown prototypes or released new versions of their existing platforms demonstrating new developments in this space during the last 12–18 months. For example, Meta showcased its Seamless M4T Universal Speech Translator, a prototype for real-time translation and voice cloning across multiple languages. In dynamic live conversations, the AI seamlessly translated each speaker’s words into other languages while simultaneously replicating voice style and achieving lip-syncing.
Meta’s Seamless M4T Universal Speech Translator provides dynamic, real-time text and speech translation and even translates code-switching speech that incorporates multiple languages into the same conversation.
In February 2023, Google introduced a new feature to Translate. As its name suggests, Local Context considers the local context of phrases and expressions. The result is translations that adapt to regional dialects, slang, and cultural references.
In a similar vein, Microsoft Translator rolled out its Custom Speech Models platform, allowing users to train AI models with their own data, encompassing domain-specific terminology and industry jargon.
Microsoft Translator is a key component of the new paid Copilot+ service and hardware. The hardware in the form of Surface tablets and laptops will include a feature of Copilot+ Live Captions that will create live English captions on-the-fly from any content, with an audio track that had a total of 40 languages at launch.
These major developments demonstrate the general direction of travel in the AI subtitling industry, with a range of suppliers implementing similar context models to improve accuracy around local idioms and content-specific jargon. However, with these large tech firms developing their own processes and models, which will likely be integrated directly into their existing products, this trend may threaten specialist companies in the captioning, subtitling, and AI markets, such as Ai-Media, Nuvo, and others.
Live AI-powered subtitling with Ai-Media LEXI 3.0
Of course, how widely these solutions are adopted—and the ability of smaller providers to compete—will be largely determined by the accuracy and quality of subtitles. Without these, content services become something of a damp squib. This does suggest that a new area of innovation is underway, as these kinds of in-built offerings from large tech companies may dramatically reduce the need for services to directly translate their video content, as the viewing platform will do that on-the-fly—at least in the realm of PCs and mobile devices.
The question for the content platforms, in a world where the quality of device-created subtitles is high, is does creating custom subtitles and captions with greater context make economic sense? In the near term, we may see the content platforms increasingly rely on device-based subtitles as they become more prevalent while still producing platform-based subtitles to support dumber devices such as TVs and set-top boxes.
Real-Time Subtitling and Translation
The heightened speed and performance of these models means that real-time subtitles can be created with almost zero lag for live content. This is a key development for both Amazon and Netflix, as they are now creating and licensing more live content than ever before. During the last 12 months, we’ve seen a significant increase in the number of services that offer real-time subtitling and captioning, such as Microsoft Azure Speech Services’ live captioning. These services are beginning to deliver real-time subtitles for live content with a very high degree of accuracy, and so we are reaching the point where it doesn’t matter if content is live or on demand: everything can have multi-language subtitles.
We should also remember that subtitling makes all content more accessible to those with hearing loss. This is potentially a huge part of the audience in the UK alone, where more than 11 million people have moderate-to-severe hearing loss. This functionality also makes content accessible for those who want to use captions for other reasons, like not wanting sound on in the office or those watching on the bus who don’t want to use or find their headphones.
Other Ways AI Will Revolutionise Multi-Language Content Distribution
Subtitling is potentially only the first step in how AI will revolutionise the distribution of content in multiple languages.
Streaming platforms have also been using AI to expand their audio descriptions tools since 2022. Netflix’s All the Light We Cannot See limited series demonstrated some of the most detailed audio description of any Netflix content. The in-house AI is able to analyse on-screen video elements, including the locations, the actors present, and their actions, and create the audio description while understanding how these various components interact on screen.
AI-powered audio descriptions from the Netflix All the Light We Cannot Seetrailer
Beyond these captioning, subtitling, and description services, the next big innovation will be in the dubbing and automated lip syncing for content in new languages.
The first step will be the continued development of live audio translation. Some services are already being used extensively in non-video environments. One such example is Spotify’s Voice Translation, which translates podcasts into additional languages while simulating the audio replacement in the voice of the original host. Voice Translation is one of several significant developments in this area within the last year.
Lip-sync technology has also made great leaps in the last 12 months, with the enhanced ability to produce a convincing lip sync for new language audio dubs as well as simulating the original actor’s voice in different languages in a similar manner to the Spotify service. These synthetic voices are increasingly capable of reflecting nuances of both emotion and idiom across a range of languages.
There are still significant improvements to be made when it comes to accurately replicating an actor’s voice and style in multiple languages, as the AI technology tends to slightly normalise voices by removing strong regional accents. There is also the issue of the emotional resonance of the content and AIs being unable to always give an “authentic” performance, which is why, currently, there is not wider adoption of the tech.
Finally, regional and cultural factors must be considered to ensure that the content and performance resonate with the audience. As we have seen with a lot of the other AI services, it will probably be simply a matter of time before many of these problems are overcome.
Legal questions persist regarding the appropriation of existing actors’ voices for multi-language dubbing, including using these voices as sources to train AI models. Resolving these rights issues is key to the continual development of the tech, as can be seen with Scarlett Johansson’s claim that OpenAI synthesised a voice similar to hers as part of a new release of ChatGPT after she declined its request to use her voice as a model. OpenAI denied that it was meant to be a version of her voice but did remove the option to use it.
To ensure AI technology’s continuing development and application, a legal framework and payment standards will need to be created to secure AI rights to using actors’ voices in multiple languages as well as to train AI models. For example, if an AI voice simulation automatically translates an actor’s lines into a new language and creates something culturally insensitive, how is the actor protected from damage to their reputation?
Licensing artists’ vocal and visual likenesses for use in AI-generated content may prove particularly challenging for content localisation efforts, as the regulatory frameworks may differ across national borders or as rights agreements don’t extend internationally.
AI, Accessibility, and Sign Language
The advances in multi-language dubbing and subtitling have been impressive so far in increasing the accessibility of content to both audiences in different countries and the hearing-impaired. Now, AI may offer a real opportunity to massively expand the accessibility of content through its sign language interpretation.
Sign language interpretation has historically been difficult to add to a large amount of content due to its reliance on human interpreters who are willing to appear on camera within content. Added to this is the fact that sign language differs for almost every language in the world, outside of the more common American and British sign languages. These differences have created a bottleneck effect in which the lack of interpreters and their associated cost have limited the amount of content that can be interpreted.
Several companies, including Signapse AI, Kara Technologies, and SIMAX, have begun to offer virtual avatars which can be fed either audio or text and will interpret in a number of sign languages. These avatars range from expressive animated characters to photorealistic interpreters.
Kara Technologies’ AI-generated virtual sign language interpreters
Although many of these solutions continue to show real progress, by and large, they are not yet ready for widespread use and have a limited number of use cases in which they can be effective. The key areas for improvement are in the overall volume of vocabulary the avatars can interpret and the need for an interpreter to not only use their hands to interpret but also their whole-body movement and facial expression to carry the meanings of words. Sign language interpretation is a highly complex task, which, at its core, requires a very human understanding to produce accurate and easily understood interpretations.
However, having seen the rapid increase in accuracy of AI subtitling in the last year, it is clear that AI sign language interpretation is a promising area of development. But it will require much more work, resources, and training to make it a viable option for the future. It is definitely a space worth keeping a close eye on, given that an estimated 1% of the world’s population uses sign language.
Virtual Speakers and Actors
A final area to look at in terms of AI usage in content localisation is something that is at an even earlier stage than models for sign language interpretation, and that is the world of the virtual presenter/actor.
3D capture of actors and the rendering of virtual doubles for VFX work has been with us for a reasonably long time. However, the recent launch of several virtual speaker services, which allow you to capture a person via a webcam and then render them to deliver a speech to a camera using audio they have recorded or text that is interpreted by an AI simulation of their voice, opens up a whole new world of talent replacement for content.
When it comes to localisation, rather than simply dubbing an existing actor into a new language with lip sync, you could, in theory, replace one actor’s entire performance with an actor who is more popular/more recognisable in that region. Imagine, for example, Chris Pratt appearing in Guardians of the Galaxy 3 in English-language markets, but in India, Star-Lord being played by a photorealistic avatar of Prabhas.
The potential for this kind of technology as it becomes more accurate and cost-effective is tremendous, since it creates the possibility of customising a platform’s content not only to a region but even to a specific group of users. For example, imagine fans of Tom Cruise watching him as Iron Man.
The widespread deployment of this technology is still a fair distance away, but the idea of an actor being digitised once and then used many times in different productions has massive implications. It was at the core of the recent SAG-AFTRA strike. Again, the determining factor will be not only the technology reaching maturity, but also the development of robust agreements between actors and content creators to ensure there is a fair deal for all.
The rapid evolution of AI has opened multiple opportunities for content owners and creators to cost-effectively make their content accessible across more territories and to more audiences than ever before. This accessibility will only continue to expand.
The key elements for its growth will come down to how quickly the technology develops, the willingness of the platforms to support different requirements for different audiences, and whether they can begin to develop legal frameworks which support the usage of the tech without disenfranchising talent.
The potential for cost-effective, highly customisable accessible content for every user’s language and region is huge. However, there are legal and ethical challenges to overcome to embed these benefits into the platforms long term. In an ever-evolving technological landscape, the real issue may be the ability of creators, legislators, platform owners, and talent to keep up with the speed of change.
Related Articles
Since the mass adoption of working from home during the pandemic, we have seen massive growth in platforms such as Teams and Zoom for not only meetings that would have previously been carried out face-to-face, but also for training sessions, events, and internal communications. While being incredibly useful and widely used communication tools, these services are not without some serious problems when it comes to engaging and retaining an audience.
26 Aug 2024
Today, localisation remains a critical budgetary line item for content owners delivering shows to diverse and transnational audiences, and it is probably one whose typical costs have not, until recently, changed considerably in quite some time. The increasingly prevalent use of AI in content localisation, subtitling, and translation promises to change all of that—particularly through the controversial and ethically fraught use of imitative synthetic voices.
26 Aug 2024