Upcoming Industry Conferences
Streaming Media West [13-14 November 2018]
Live Streaming Summit [13-14 November 2018]
Past Conferences
Streaming Media East 2018 [8-9 May 2018]
Live Streaming Summit [8-9 May 2018]
Content Delivery Summit [7 May 2018]
Streaming Forum [27 February 2018]

The State of Video and AI 2018
The machines aren't taking over; they're just helping video publishers achieve their goals more efficiently and effectively.

Video AI (artificial intelligence) has the capacity to solve a number of time-consuming, video-related problems with automation. But that doesn’t mean it has magical powers that will exclude human control. To offer a sense of where video AI is in early 2018, what follows are a number of real-life examples in which AI is helping to add structure to the unstructured world of video.

First, what is AI? The difference between an AI system and standard algorithms is AI uses data to learn, predict, and alter an outcome based on learning through data processing. It all starts with having access to a large source of data, whether an archive of images for image recognition or QoS playback records. A software algorithm is designed to solve a particular problem, and training data is used to enable the software to learn and recognise a match.

The Basics

In video, AI can solve a number of problems, including figuring out where specific content is, automatically generating clips, making sure both content and advertisement more closely align with viewer interests, and ensuring that video playback is optimised.

To enable video image recognition, content is scanned to create an enhanced metadata file. The contents of the metadata can vary, from image and text recognition that can identify all imagery within a clip to creating a full-text audio transcription that uses natural-language processing to identify all audio contents of a video clip. This is generally output as a JavaScript Object Notation (JSON) file that includes timecode-based metadata that can be used to search against or organise content in any supported online video platform, content management system, or media asset management system.

But AI needs human guidance. For example, a “confidence level” is applied to image assessment, and anything that falls below a certain score is sent to a human team for evaluation. “Accuracy is a function of many factors including, (but not limited to), resolution of the video (i.e. HD vs. SD), the noise level in the video (e.g. a lot of background noise can affect the accuracy of transcription), and the amount of motion in the video (fast-moving videos are tougher if the resolution is low),” says Milan Gada, principal program manager, video AI cloud services, Microsoft. With the right combination of factors, he says, it’s possible to achieve accuracy in the high 90 percent range.

Cloud scalability also plays a role in AI. Users are now able to access speed and processing power that wasn’t available a few years ago. “If you have 5,000 videos, each 1 hour long, and if it takes 1 hour to process each video, you could use 5,000 machines and have all your videos processed in an hour,” says Gada. Microsoft has two main AI products—Video Indexer and Azure Media Analytics. (Microsoft provided more details on its AI solutions at Streaming Media West 2017.)

One of the first uses for video AI was to generate transcripts and keywords. Microsoft’s Video Indexer (shown here) can achieve accuracy in the high 90 percent range, according to Milan Gada, principal program manager, video AI for the company. 

Use Case: Entertainment Processing and Highlights

IBM Watson has more than 60 AI services, and at IBC this past Autumn, the company showcased the work it did for the U.S. Open using its first video-specific product called Watson Video Enrichment, which includes services for scene detection, speech-to-text conversion, natural-language processing, visual recognition, tone analyser, and personality insight.

“The USTA (US Tennis Assoc.) provided us with several hundred tennis videos,” says David Kulczar, senior product manager, Watson Video Analytics, IBM Watson Media. This content was used to train Watson in tennis-specific intelligence, including player names, game scoring, sports terminology, and crowd sentiment analysis—understanding excitement levels, including crowd cheering, gasps, even player facial and physical expressions like fist pumps.

Watson does a full-text transcription of all audio within a piece of content, plus all image and text information, to create a detailed metadata file, complete with timecode for each piece of content. Watson also understands broader concepts like knowing tennis and basketball are sports. “We did the full training process in about a month and that was taken from 80 percent accuracy to 95 percent+ accuracy,” says Kulczar.

“A lot of time people tend to over-believe in [AI],” says Kulczar. “Some people think of artificial intelligence as sort of magic, but it’s not. A machine-learning-based principal is going to make mistakes. It’s a much more complex version of what you do with Pandora. Somebody is actually thumbing up or thumbing down a video to help the system get better and learn your preferences.”

During the U.S. Open, IBM wanted to automatically create clips based on the most compelling content. Watson works at about three-quarters of real time for full content assessment with normally complex images, and can work with content as low as 256Kbps, although IBM recommends 1Mbps. “If you take the images out of it and you’re just doing audio and textual it’s amazingly fast,” Kulczar says.

There were 320 hours of play coverage, and IBM’s custom solution immediately created clips at the end of each match and pushed these highlights out to social sites to drive more interest in the tournament.

IBM has an off-the-shelf product too, and this year the company will be coming out with knowledge kits for specific industries and live-streaming processing. “We provide a corpus of knowledge, a body of knowledge out-of-the-box,” Kulczar says. “Generally, if you want to increase the accuracy of the service, you want to train on a specific domain.” After the detailed metadata content has been created, it can be searched for specific instances of particular events or content. Users will also be able to do custom training for business-specific information. For example, an athletic company could train Watson to identify its specific brands or products, like running shoes. [Editor's Note 5/17/18: IBM has since come out with Watson Data Kits, which not designed specifically for live-streaming processiing.]

Use Case: Enterprise Content Management

While a tennis tournament has some fairly specific activity, Axon (formally Taser) received a similar request for creating clips from subject matter that was more of a moving target. Axon provides public safety technology and equipment, including cameras that many police forces use to record video of daily interactions with the public. Axon was selected to be the official AI partner of the Los Angeles Police Department. “The LAPD has accumulated roughly 33 years’ worth of video data in the past year alone,” says Daniel Ladvocat Cintra, senior product manager at Axon AI Research. The department’s challenge is how to find anything valuable within video footage that is the equivalent of one single camera streaming 24/7 since 1985.

Axon obviously needed to create a post-production tool to accelerate footage review. Axon staffers are training its AI to understand what kind of incident was recorded, so they can find the difference between a pursuit or a pedestrian stop. “20 minutes of footage might only contain 30 seconds’ worth of information that’s pertinent,” says Cintra. The company worked with the LAPD to find which parts of an incident contain valuable information, like a specific object or activities, to help build a 30-second clip of the most relevant content within a 20-minute-long video.

Another issue police agencies all over have is upholding privacy laws when providing video footage to a court case or a Freedom of Information Request. The company heard from customers that it can take up to 8 hours to redact 1 hour of video footage, where an officer needs to go thru the content and blur personally recognisable information.

Axon is using AI to help remove recognisable images from video content automatically in the post-production process. “Right now we blur skin (including tattoos) and faces,” says Cintra. Axon is able to detect where objects are on the screen and then have an officer OK the redaction. The content can then be released for use.

Axon’s police agencies upload their video content to a secure cloud-based environment, Evidence.com, which is run on the Microsoft Azure platform. To date, 14.9 petabytes of data have been uploaded to Evidence. com, with 11.5 petabytes currently active.

The Los Angeles Police Department collects a tremendous amount of video and is using AI to determine what footage contains valuable data. 

Use Case: Asset Management

While several of the previous examples include custom development, the AI in Cantemo’s media asset management product Iconik is available to any customer of the tool. The company’s hybrid cloud product stores high-resolution assets on customer’s infrastructure and works as an aggregation layer in the cloud, providing a holistic view of all assets an organisation owns. “The most common way we are using AI today is to recognise assets. Many companies have thousands and thousands of hours of content and traditionally they have been using a manual workflow to tag each scene or each frame to describe what it is,” says Parham Azimi, CEO and co-founder at Cantemo. Cantemo has built an AI framework that integrates with existing machine learning systems that can be used to identify content.

Cantemo shares the proxy version of the video content to the machine learning system—the first framework the company integrated with is the Google Cloud Vision API image analytics. What comes back are tags associated with timecode, along with a confidence level describing the content based on image recognition. It looks something like this:

Start timecode: 00:00:12:10

End timecode: 00:00:16:15

Tags: Spacecraft (75 percent), Outer space (89 percent), Space station (72 percent), Cat (20 percent)

The timecodes define when a sequence starts and ends. The tags describe what is shown in the sequence, along with a confidence level for how correct each tag is likely to be. So the above says that the sequence is most probably a spacecraft in outer space, and probably not a cat, even though there might have been something that is similar to a cat.

AI can be used in asset management to help identify video content and apply relevant tags and timecode. 

Related Articles
Artificial intelligence and machine learning, along with deep learning and neural networks, are solving OTT challenges from encoding quality to closed captioning.
Nagra details five trends it says will shape how companies deliver-and how consumers purchase-TV in the coming year.
New AI implementations allow for content analysis to detect everything from faces to brands and even the type of content, offering value for content owners, brands, and consumers