For the production of an average-length theatrical or TV release, filmmakers usually shoot hundreds of hours of audio-visual material and spend months editing this profusion of content. In general, shooting ratios (raw footage to final film) are increasing significantly, thanks to digital recording tools, yet the availability of professional video editors and editing assistants remains scarce.
Currently, the leading tools and services that assist film and TV editors have mostly been focused on transcribing what’s being said and matching this dialogue with the corresponding video clip (ScriptSync, DeScript). That is, they try to do media indexing by listening to what is being said in the footage and making that searchable, also called phonetic indexing.
While this is a good first approach, it overlooks crucial information available in the filmed material, namely everything that can be seen, such as the type of framing, actions taking place and visual characteristics of the environment. Basically, it only addresses one of the many components of what constitutes a full audiovisual story.
Prominent Computer Vision (CV) services, like GCP, Azure and AWS, are all generic tag based, meaning they only look for generic objects, not specialised image features as described above. This means that most of the tags they find are useless in a media production or archive.
Within our proposed AI4Media project we will create a Natural Language Media Indexing Engine (NLMIE) that can analyse images and their relation to natural language text, and an Application Programming Interface (API) which will allow easy integration into the pipelines of content management systems (CMS), media production studios, and film and TV archives.
The team:
Piotr Winiewicz |
Esbern Kaspersen |
Mads Damso |
Sofie Lykke Stenstrop |