Scientific Papers

Find all the scientific publications produced by AI4Media partners, presenting the latest scientific findings of the project.

Filters

04/12/2024

An annotated video dataset for computing video memorability

access here

Author(s)

Alan F. Smeaton Alba Seco de Herrera; Bogdan Ionescu; Claire-Hélène Demarty; Faiyaz Doctor Graham Healy; Lorin Sweeney; Mihai Gabriel Constantin Rukiye Savran Kiziltepe;

Institution

Dublin City University; InterDigital Paris University of Essex; University Politehnica of Bucharest

Abstract

Using a collection of publicly available links to short form video clips of an average of 6 seconds duration each, 1,275 users manually annotated each video multiple times to indicate both longterm and short-term memorability of the videos. The annotations were gathered as part of an online memory game and measured a participant’s ability to recall having seen the video previously when shown a collection of videos. The recognition tasks were performed on videos seen within the previous few minutes for short-term memorability and within the previous 24 to 72 hours for long-term memorability. Data includes the reaction times for each recognition of each video. Associated with each video are text descriptions (captions) as well as a collection of image-level features applied to 3 frames extracted from each video (start, middle and end). Video-level features are also provided. The dataset was used in the Video Memorability task as part of the MediaEval benchmark in 2020.

Access

Open Access

Type of Publication

Journal article

Publisher

Data in Brief

access here

01/11/2024

Towards Wine Tasting Activity Recognition for a Digital Sommelier

access here

Author(s)

Daniel Gatica-Perez; Mario Parra

Institution

Idiap Research Institute

Abstract

In this study, we evaluated the feasibility of using zero-shot classification models for activity recognition in a Digital Sommelier. Our experiment involved preprocessing video data by extracting frames and categorization user activities related to a wine-tasting scenario. Image classification models demonstrated high accuracy, nearing 90%, in distinguishing between “engaged” and ” disengaged” stated. however, video classification models presented a lower performance in classifying user activities such as “observing wine”, “smelling wine” and “snipping wine”, with an average accuracy of around 50% due to the interdependent nature of activities. Despite these challenges, our findings highlight the potential of zero-shot classification models in enhancing virtual assistants’ ability to recognize and respond to user activities.

Access

Open Access

Type of Publication

Publication

Publisher

N/A

access here

18/10/2024

Disinformation Echo Chambers on Facebook

access here

Author(s)

Mathias-Felipe de Lima-Santos Wilson Ceron

Institution

Universidade Federal de São Paulo; University of Amsterdam;

Abstract

The information landscape has undergone significant transformations with the widespread adoption of the internet and online social networks. This has led to both positive and negative consequences. On the positive side, information can now spread quickly and reach a vast audience. Social media platforms have played a crucial role in fostering a culture of participation by motivating people to actively create and share content. However, there were also drawbacks. Social media platforms employ algorithms that restrict the diversity of content users are exposed to, leading to the reinforcement of pre-existing beliefs, commonly referred to as “echo chambers”

Access

Open Access

Type of Publication

Book section

Publisher

Mapping Lies in the Global Media Spherre - Routlege

access here

16/10/2024

Planckian Jitter: Countering the Color- Crippling Effects of Color Jitter on Self- Supervised Training

access here

Author(s)

Alex Gomez-Villa Bartłomiej Twardowski Joost van de Weijer Marco Buzzelli Simone Zini

Institution

Autonomous University of Barcelona University of Florence; University of Milano- Bicocca

Abstract

Several recent works on self-supervised learning are trained by mapping different augmentations of the same image to the same feature representation. The data augmentations used are of crucial importance to the quality of learned feature representations. In this paper, we analyze how the color jitter traditionally used in data augmentation negatively impacts the quality of the color features in learned feature representations. To address this problem, we propose a more realistic, physics-based color data augmentation – which we call Planckian Jitter – that creates realistic variations in chromaticity and produces a model robust to illumination changes that can be commonly observed in real life, while maintaining the ability to discriminate image content based on color information. Experiments confirm that such a representation is complementary to the representations learned with the currently-used color jitter augmentation and that a simple concatenation leads to significant performance gains on a wide range of downstream datasets.
In addition, we present a color sensitivity analysis that documents the impact of different training methods on model neurons and shows that the performance of the learned features is robust with respect to illuminant variations. Official code available at: https://github.com/TheZino/PlanckianJitter

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

10/10/2024

How reliable are posterior class probabilities in automatic music classification?

access here

Author(s)

Hanna Lukashevich Jakob Abeßer Joachim Bös Sascha Grollmisch Sebastian Stober

Institution

Fraunhofer IDMT; Otto-von-Guericke University Magdeburg

Abstract

Music classification algorithms use signal processing and machine learning approaches to extract and enrich metadata for audio recordings in music archives. Common tasks include music genre classification, where each song is assigned a single label (such as Rock, Pop, or Jazz), and musical instrument classification. Since music metadata can be ambiguous, classification algorithms cannot always achieve fully accurate predictions. Therefore, our focus extends beyond the correctly estimated class labels to include realistic confidence values for each potential genre or instrument label. In practice, many state-of-the-art classification algorithms based on deep neural networks exhibit overconfident predictions, complicating the interpretation of the final output values. In this work, we examine whether the issue of overconfident predictions and, consequently, non-representative confidence values is also relevant to music genre classification and musical instrument classification.
Moreover, we describe techniques to mitigate this behavior and assess the impact of deep ensembles and temperature scaling in generating more realistic confidence outputs, which can be directly employed in real-world music tagging applications.

Access

Open Access

Type of Publication

Conference paper Publication

Publisher

Audio Mostly Conference

access here

09/10/2024

Rhythmic and Psycholinguistic Features for Authorship Tasks in the Spanish Parliament: Evaluation and Analysis

access here

Author(s)

Alejandro Moreo; Berta Chulvi Paolo Rosso Silvia Corbara

Institution

ISTI-CNR; Scuola Normale Superiore Universitat Politècnica de Valènvia

Abstract

Among the many tasks of the authorship field, Authorship Identification aims at uncovering the author of a document, while Author Profiling focuses on the analysis of personal characteristics of the author(s), such as gender, age, etc. Methods devised for such tasks typically focus on the style of the writing, and are expected not to make inferences grounded on the topics that certain authors tend to write about. In this paper, we present a series of experiments evaluating the use of topic- agnostic feature sets for Authorship Identification and Author Profiling tasks in Spanish political language. In particular, we propose to employ features based on rhythmic and psycholinguistic patterns, obtained via different approaches of text masking that we use to actively mask the underlying topic. We feed these feature sets to a SVM learner, and show that they lead to results that are comparable to those obtained by a BETO transformer, when the latter is trained on the original text, i.e., potentially learning from topical information. Moreover, we further investigate the results for the different authors, showing that variations in performance are partially explainable.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

08/10/2024

FLODCAST: Flow and depth forecasting via multimodal recurrent architectures

access here

Author(s)

Alberto Del Bimbo; Andrea Ciamarra Federico Becattini; Lorenzo Seidenari;

Institution

University of Florence;

Abstract

Forecasting motion and spatial positions of objects is of fundamental importance, especially in safety-critical settings such as autonomous driving. In this work, we address the issue by forecasting two different modalities that carry complementary information, namely optical flow and depth. To this end we propose FLODCAST a flow and depth forecasting model that leverages a multitask recurrent architecture, trained to jointly forecast both modalities at once. We stress the importance of training using flows and depth maps together, demonstrating that both tasks improve when the model is informed of the other modality. We train the proposed model to also perform predictions for several timesteps in the future. This provides better supervision and leads to more precise predictions, retaining the capability of the model to yield outputs autoregressively for any future time horizon. We test our model on the challenging Cityscapes dataset, obtaining state of the art results for both flow and depth forecasting. Thanks to the high quality of the generated flows, we also report benefits on the downstream task of segmentation forecasting, injecting our predictions in a flow-based mask-warping framework.

Access

Open Access

Type of Publication

Journal article

Publisher

Pattern Recognition Letters

access here

22/09/2024

InDistill: Information flow-preserving knowledge distillation for model compression

access here

Author(s)

Christos Koutlis; Giorgios Kordopatis-Zilos; Ioannis Kompatsiaris; Ioannis Sarridis Symeon Papadopoulos

Institution

CERTH - Center for Research and Technology Hellas

Abstract

In this paper, we introduce InDistill, a method that serves as a warmup stage for enhancing Knowledge Distillation (KD) effectiveness. InDistill focuses on transferring critical information flow paths from a heavyweight teacher to a lightweight student. This is achieved through a curriculum learning-based training scheme that considers the distillation difficulty of each layer and the critical learning periods when the information flow paths are established. This procedure can lead to a student model that is better prepared to learn from the teacher. To ensure the applicability of InDistill across a wide range of teacher-student pairs, we also incorporate a pruning operation when there is a discrepancy in the width of the teacher and student layers. This pruning operation reduces the width of the teacher’s intermediate layers to match those of the student, allowing direct distillation without the need for an encoding stage. The proposed method is extensively evaluated using various pairs of teacher-student architectures on CIFAR-10, CIFAR-100, and ImageNet datasets showcasing that preserving the information flow paths consistently increases the performance of the baseline KD approaches on both classification and retrieval settings.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

28/08/2024

Domain Expertise Assessment for Multi-DNN Agent Systems

access here

Author(s)

Christos Papaioannidis Ioanna Valsamara; Ioannis Pitas;

Institution

Aristotle University of Thessaloniki;

Abstract

Recently, multi-agent systems that facilitate knowledge sharing among Deep Neural Network (DNN) agents, have gained increasing attention. This paper explores the dynamics of multi-agent systems that support Teacher-Student DNN interactions, where knowledge is distilled from Teachers to Students. Within such systems, selecting the most compatible Teacher for a given task is far from trivial and can lead to low-quality decisions. Hence, the need arises for accurate domain knowledge evaluation. In that context, we propose including an OOD detection module in each DNN agent to enable effective agent expertise evaluation and precise identification of suitable Teachers. This setup allows Student agents to distill knowledge from the most knowledgeable Teachers within a specific domain, ensuring optimal system performance. To effectively utilize OOD detection in this context, we address key challenges such as determining the minimum data cardinality required to ensure optimal performance and reliable inferences of the OOD detectors.

Access

Open Access

Type of Publication

Paper Publication Research article

Publisher

N/A

access here

28/08/2024

Efficient Data Utilization in Deep Neural Networks for Enhanced Inference Reliability

access here

Author(s)

Christos Papaioannidis Ioanna Valsamara; Ioannis Pitas;

Institution

Aristotle University of Thessaloniki;

Abstract

In today’s data-driven world, the exponential growth of data across various sectors presents unique opportunities and challenges. In this paper, we propose a novel method tailored to enhance the efficiency of Deep Neural Networks (DNNs) in managing these vast data amounts. The primary challenge addressed is the ability of DNNs to provide inferences on the minimal amount of data without sacrificing their quality, a significant concern given the vast scales involved in big data analytics. Our approach emphasizes DNN inference efficiency and reliability, enabling DNNs to deliver accurate inferences while substantially reducing computational complexity. This study explores the increasingly attractive deployment of DNNs for complex tasks, focusing on determining the minimal amount of data necessary to ensure optimal network performance and reliable inference outputs, improving the applicability of DNNs across various big data environments.

Access

Open Access

Type of Publication

Paper Publication Research article

Publisher

N/A

access here

28/08/2024

Proof of Quality Inference (PoQI): An AI Consensus Protocol for Decentralized DNN Inference Frameworks

access here

Author(s)

Dimitrios Papaioannou; Ioannis Pitas; Vasileios Mygdalis

Institution

Aristotle University of Thessaloniki; University of Antwerp

Abstract

In the realm of machine learning systems, achieving consensus among networking nodes is a fundamental yet challenging task. This paper presents Proof of Quality Inference (PoQI), a novel consensus protocol designed to integrate deep learning inference under the basic format of the Practical Byzantine Fault Tolerant (P-BFT) algorithm. PoQI is applied to Deep Neural Networks (DNNs) to infer the quality and authenticity of produced estimations by evaluating the trustworthiness of the DNN node’s decisions. In this manner, PoQI enables DNN inference nodes to reach a consensus on a common DNN inference history in a fully decentralized fashion, rather than relying on a centralized inference decision-making process. Through PBFT adoption, our method ensures byzantine fault tolerance, permitting DNN nodes to reach an agreement on inference validity swiftly and efficiently. We demonstrate the efficacy of PoQI through theoretical analysis and empirical evaluations, highlighting its potential to forge trust among unreliable DNN nodes.

Access

Open Access

Type of Publication

Paper Preprint Publication

Publisher

N/A

access here

28/08/2024

Photoconsistent and trajectory guided novel-view synthesis tool for uav cinematography based on autoregressive transformers

access here

Author(s)

Antidio Viguria Francisco Pérez-Grau Ioannis Pitas; Marco Montes-Grova Vasileios Mygdalis

Institution

Aristotle University of Thessaloniki;

Abstract

Novel view synthesis is the task of generating new images that render an object or scene from a different viewpoint than the one given. It aims to create new views of a specific subject starting from a number of pictures taken from known points of view. The novel view synthesis problem can be approached in two different ways: as a problem of interpolation of images between two known images or extrapolation of images from one or a subset of images. During this work, the problem of the extrapolation will be addressed, taking advantage of the fact that it is possible to pre-calculate the trajectories that we want the camera that takes the images to execute, from a series of known shot-types. Based on that and on the Autoregressive Transformers, it is presented an end-to-end tool for novel-view synthesis from previously unvisited points of view for aerial cinematography robots.

Access

Open Access

Type of Publication

Paper Preprint Publication

Publisher

N/A

access here

28/08/2024

Lightweight Human Gesture Recognition Using Multimodal Features

access here

Author(s)

Anestis Christidis Christos Papaioannidis Ioannis Mademlis; Ioannis Pitas;

Institution

Aristotle University of Thessaloniki;

Abstract

Human gesture recognition is a very important tool in human-computer or human-robot interaction. In many cases, such algorithms may need to be executed on systems with limited computational capabilities, due to size or weight constraints, introducing restrictions that can impede gesture recognition performance. This paper proposes a gesture recognition method that is based on a very simple and lightweight Deep Neural Network (DNN) architecture, suitable for embedded execution. In order to achieve increased accuracy without a large computational/memory overhead, the proposed method utilizes as input both full 2D human body skeletons and image patches extracted from regions of interest (e.g., around human arms) in each video frame. These two input types are processed in parallel by separate modules and the corresponding features are fused before being exploited for gesture recognition. Reliance on 2D skeleton sequences allows the utilization of a lightweight DNN architecture, while the image patches convey rich semantic information that enhances gesture recognition performance. This approach is unlike existing similar methods, which only exploit skeleton sequences. Experimental evaluation indeed shows increased recognition accuracy, indicating that the proposed method offers a reliable solution for human gesture recognition on embedded systems.

Access

Open Access

Type of Publication

Paper Preprint Publication

Publisher

N/A

access here

28/08/2024

Scalable bio-inspired training of Deep Neural Networks with FastHebb

access here

Author(s)

Claudio Gennaro; Fabrizio Falchi; Gabriele Lagani; Giuseppe Amato;

Institution

ISTI-CNR;

Abstract

Recent work on sample efficient training of Deep Neural Networks (DNNs) proposed a semi-supervised methodology based on biologically inspired Hebbian learning, combined with traditional backprop-based training. Promising results were achieved on various computer vision benchmarks, in scenarios of scarce labeled data availability. However, current Hebbian learning solutions can hardly address large-scale scenarios due to their demanding computational cost. In order to tackle this limitation, in this contribution, we investigate a novel solution, named FastHebb (FH), based on the reformulation of Hebbian learning rules in terms of matrix multiplications, which can be executed more efficiently on GPU. Starting from Soft-Winner-Takes-All (SWTA) and Hebbian Principal Component Analysis (HPCA) learning rules, we formulate their improved FH versions: SWTA-FH and HPCA-FH. We experimentally show that the proposed approach accelerates training speed up to 70 times, allowing us to gracefully scale Hebbian learning experiments on large datasets and network architectures such as ImageNet and VGG.

Access

Open Access

Type of Publication

Journal article

Publisher

access here

27/08/2024

In the Wild Video Violence Detection: An Unsupervised Domain Adaptation Approach

access here

Author(s)

Carlos Santiago; Claudio Gennaro; Fabrizio Falchi; Giuseppe Amato; Luca Ciampi;

Institution

Institute of Information Science and Technologies Instituto Superior Técnico;

Abstract

This work addresses the challenge of video violence detection in data-scarce scenarios, focusing on bridging the domain gap that often hinders the performance of deep learning models when applied to unseen domains. We present a novel unsupervised domain adaptation (UDA) scheme designed to effectively mitigate this gap by combining supervised learning in the train (source) domain with unlabeled test (target) data. We employ single-image classification and multiple instance learning (MIL) to select frames with the highest classification scores, and, upon this, we exploit UDA techniques to adapt the model to unlabeled target domains. We perform an extensive experimental evaluation, using general-context data as the source domain and target domain datasets collected in specific environments, such as violent/non-violent actions in hockey matches and public transport. The results demonstrate that our UDA pipeline substantially enhances model performances, improving their generalization capabilities in novel scenarios without requiring additional labeled data.

Access

Open Access

Type of Publication

Journal article

Publisher

SN Computer Science

access here

16/08/2024

Deep Variational Learning for 360° Adaptive Streaming

access here

Author(s)

Alberto Del Bimbo; Federico Becattini; Francesco Marchetti; Lorenzo Seidenari; Lucile Sassatelli; Quentin Guimard

Institution

Institut Universitaire de France; Université Côte d'Azur; University of Florence;

Abstract

Prediction of head movements in immersive media is key to designing efficient streaming systems able to focus the bandwidth budget on visible areas of the content. However, most of the numerous proposals made to predict user head motion in 360° images and videos do not explicitly consider a prominent characteristic of the head motion data: its intrinsic uncertainty. In this article, we present an approach to generate multiple plausible futures of head motion in 360° videos, given a common past trajectory. To our knowledge, this is the first work that considers the problem of multiple head motion prediction for 360° video streaming. We introduce our discrete variational multiple sequence (DVMS) learning framework, which builds on deep latent variable models. We design a training procedure to obtain a flexible, lightweight stochastic prediction model compatible with sequence-to-sequence neural architectures. Experimental results on 4 different datasets show that our method DVMS outperforms competitors adapted from the self-driving domain by up to 41% on prediction horizons up to 5 sec., at lower computational and memory costs. To understand how the learned features account for the motion uncertainty, we analyze the structure of the learned latent space and connect it with the physical properties of the trajectories. We also introduce a method to estimate the likelihood of each generated trajectory, enabling the integration of DVMS in a streaming system. We hence deploy an extensive evaluation of the interest of our DVMS proposal for a streaming system. To do so, we first introduce a new Python-based 360° streaming simulator that we make available to the community. On real-world user, video, and networking data, we show that predicting multiple trajectories yields higher fairness between the traces, the gains for 20 to 30% of the users reaching up to 10% in visual quality for the best number K of trajectories to generate.

Access

Open Access

Type of Publication

Journal article

Publisher

ACM Transactions on Multimedia Computing, Communications, and Applications

access here

20/07/2024

Induced Permutations for Approximate Metric Search

access here

Author(s)

Claudio Gennaro; Giuseppe Amato; Lucia Vadicamo;

Institution

ISTI-CNR;

Abstract

Permutation-based Indexing (PBI) approaches have been proven to be particularly effective for conducting large-scale approximate metric searching. These methods rely on the idea of transforming the original metric objects into permutation representations, which can be efficiently indexed using data structures such as inverted files.
The standard conceptualization of permutation associated with a metric object involves only the use of object distances and their relative orders from a set of anchors called pivots. In this paper, we generalized this definition in order to enlarge the class of permutation representations that can be used by PBI approaches. In particular, we introduced the concept of permutation induced by a space transformation and a sorting function, and we investigated which properties these transformations should possess to produce permutations that are effective for metric search. Furthermore, as a practical outcome, we defined a new type of permutation representation that is calculated using distances from pairs of pivots. This proposed technique allowed us to produce longer permutations than traditional ones for the same number of object pivot distance calculations. The advantage lies in the fact that when longer permutations are employed, the use of inverted files built on permutation prefixes leads to greater efficiency in the search phase.

Access

Open Access

Type of Publication

Journal article

Publisher

N/A

access here

15/07/2024

Efficient Unsupervised Visual Representation Learning with Explicit Cluster Balancing

access here

Author(s)

Georgios Tzimiropoulos; Ioannis Maniadis Metaxas; Ioannis Patras;

Institution

Queen Mary University of London;

Abstract

Self-supervised learning has recently emerged as the preeminent pretraining paradigm across and between modalities, with remarkable results. In the image domain specifically, group (or cluster) discrimination has been one of the most successful methods. However, such frameworks need to guard against heavily imbalanced cluster assignments to prevent collapse to trivial solutions. Existing works typically solve this by reweighing cluster assignments to promote balance, or with offline operations (e.g. regular re-clustering) that prevent collapse. However, the former typically requires large batch sizes, which leads to increased resource requirements, and the latter introduces scalability issues with regard to large datasets. In this work, we propose ExCB, a framework that tackles this problem with a novel cluster balancing method. ExCB estimates the relative size of the clusters across batches and balances them by adjusting cluster assignments, proportionately to their relative size and in an online manner. Thereby, it overcomes previous methods’ dependence on large batch sizes and is fully online, and therefore scalable to any dataset. We conduct extensive experiments to evaluate our approach and demonstrate that ExCB: a) achieves state-of-the-art results with significantly reduced resource requirements compared to previous works, b) is fully online, and therefore scalable to large datasets, and c) is stable and effective even with very small batch sizes.

Access

Open Access

Type of Publication

Conference paper

Publisher

European Conference on Computer Vision

access here

11/07/2024

NoiseBox: Towards More Efficient and Effective Learning with Noisy Labels

access here

Author(s)

Chen Feng; Georgios Tzimiropoulos; Ioannis Patras;

Institution

Queen Mary University of London;

Abstract

Despite the large progress in supervised learning with neural networks, there are significant challenges in obtaining high-quality, large-scale and accurately labelled datasets. In such contexts, how to learn in the presence of noisy labels has received more and more attention. Addressing this relatively intricate problem to attain competitive results predominantly involves designing mechanisms that select samples that are expected to have reliable annotations. However, these methods typically involve multiple off-the-shelf techniques, resulting in intricate structures. Furthermore, they frequently make implicit or explicit assumptions about the noise modes/ratios within the dataset. Such assumptions can compromise model robustness and limit its performance under varying noise conditions. Unlike these methods, in this work, we propose an efficient and effective framework with minimal hyperparameters that achieves SOTA results in various benchmarks. Specifically, we design an efficient and concise training framework consisting of a subset expansion module responsible for exploring non-selected samples and a model training module to further reduce the impact of noise, called NoiseBox . Moreover, diverging from common sample selection methods based on the “small loss” mechanism, we introduce a novel sample selection method based on the neighbouring relationships and label consistency in the feature space. Without bells and whistles, such as model co-training, self-supervised pre-training and semi-supervised learning, and with robustness concerning the settings of its few hyper-parameters, our method significantly surpasses previous methods on both CIFAR10/CIFAR100 with synthetic noise and real-world noisy datasets such as Red Mini-ImageNet, WebVision, Clothing1M and ANIMAL-10N.

Access

Open Access

Type of Publication

Journal article

Publisher

N/A

access here

10/07/2024

Detecting Images Generated by Diffusers

access here

Author(s)

Andrea Esuli; Claudio Gennaro; Davide Alessandro Coccomini; Fabrizio Falchi; Giuseppe Amato;

Institution

ISTI-CNR;

Abstract

n recent years, the field of artificial intelligence has witnessed a remarkable surge in the generation of synthetic images, driven by advancements in deep learning techniques. These synthetic images, often created through complex algorithms, closely mimic real photographs, blurring the lines between reality and artificiality. This proliferation of synthetic visuals presents a pressing challenge: how to accurately and reliably distinguish between genuine and generated images. This paper, in particular, explores the task of detecting images generated by text-to-image diffusion models, highlighting the challenges and peculiarities of this field. To evaluate this, we consider images generated from captions in the MSCOCO and Wikimedia datasets using two state-of-the-art models: Stable Diffusion and GLIDE. Our experiments show that it is possible to detect the generated images using simple Multi-Layer Perceptrons (MLPs), starting from features extracted by CLIP or RoBERTa, or using traditional Convolutional Neural Networks (CNNs). These latter models achieve remarkable performances in particular when pretrained on large datasets. We also observe that models trained on images generated by Stable Diffusion can occasionally detect images generated by GLIDE, but only on the MSCOCO dataset. However, the reverse is not true. Lastly, we find that incorporating the associated textual information with the images in some cases can lead to a better generalization capability, especially if textual features are closely related to visual ones. We also discovered that the type of subject depicted in the image can significantly impact performance. This work provides insights into the feasibility of detecting generated images and has implications for security and privacy concerns in real-world applications. The code to reproduce our results is available at: https://github.com/davide-coccomini/Detecting-Images-Generated-by-Diffusers.

Access

Open Access

Type of Publication

Publication

Publisher

N/A

access here

09/07/2024

Virtual worlds, real risks: exploring user safety in the metaverse under the Digital Services Act

access here

Author(s)

Jean De Meyere Noémie Krack

Institution

KU Leuven

Abstract

Recently, the British police launched its first investigation into a case of virtual “rape” in the metaverse. This paper delves into the complex considerations that user safety and content moderation could pose through the prism of the recently adopted Digital Services Act (DSA). We first explore the current state of platform operating metaverses. Metaverses are similar to current online platforms yet are differentiated by the use of XR technologies. Despite the low number of users on such platforms, specific issues related to the metaverse, such as the rise of disinformation or virtual sex crimes, have already been reported. This paper considers the following research questions: What legal challenges do specific metaverse platforms present in terms of user safety, and how does the DSA address these challenges? Attention will be brought to the impact of relevant obligations for user safety in metaverses. We continue our analysis by addressing the lack of risk assessment obligations for platform operating metaverses, as they currently do not meet the threshold to be bound by these obligations under the DSA. We conclude with recommendations for policymakers on how to tackle the challenges posed by increased risks in the metaverse.

Access

Open Access

Type of Publication

Conference paper

Publisher

International Congress towards a responsible development of Metaverse

access here

04/07/2024

Leveraging Representations from Intermediate Encoder-blocks for Synthetic Image Detection

access here

Author(s)

Christos Koutlis; Symeon Papadopoulos

Institution

CERTH - Center for Research and Technology Hellas

Abstract

The recently developed and publicly available synthetic image generation methods and services make it possible to create extremely realistic imagery on demand, raising great risks for the integrity and safety of online information. State-of-the-art Synthetic Image Detection (SID) research has led to strong evidence on the advantages of feature extraction from foundation models. However, such extracted features mostly encapsulate high-level visual semantics instead of fine-grained details, which are more important for the SID task. On the contrary, shallow layers encode low-level visual information. In this work, we leverage the image representations extracted by intermediate Transformer blocks of CLIP’s image-encoder via a lightweight network that maps them to a learnable forgery-aware vector space capable of generalizing exceptionally well. We also employ a trainable module to incorporate the importance of each Transformer block to the final prediction. Our method is compared against the state-of-the-art by evaluating it on 20 test datasets and exhibits an average +10.6% absolute performance improvement. Notably, the best performing models require just a single epoch for training (~8 minutes). Code available at https://github.com/mever-team/rine.

Access

Open Access

Type of Publication

Conference paper

Publisher

European Conference on Computer Vision

access here

23/06/2024

Towards Optimal Trade-offs in Knowledge Distillation for CNNs and Vision Transformers at the Edge

access here

Author(s)

Ioannis Kompatsiaris; John Violos Symeon Papadopoulos

Institution

CERTH - Center for Research and Technology Hellas

Abstract

This paper discusses four facets of the Knowledge Distillation (KD) process for Convolutional Neural Networks (CNNs) and Vision Transformer (ViT) architectures, particularly when executed on edge devices with constrained processing capabilities. First, we conduct a comparative analysis of the KD process between CNNs and ViT architectures, aiming to elucidate the feasibility and efficacy of employing different architectural configurations for the teacher and student, while assessing their performance and efficiency. Second, we explore the impact of varying the size of the student model on accuracy and inference speed, while maintaining a constant KD duration. Third, we examine the effects of employing higher resolution images on the accuracy, memory footprint and computational workload. Last, we examine the performance improvements obtained by fine-tuning the student model after KD to specific downstream tasks. Through empirical evaluations and analyses, this research provides AI practitioners with insights into optimal strategies for maximizing the effectiveness of the KD process on edge devices.

Access

Open Access

Type of Publication

Conference paper

Publisher

Signal Processing and Communication

access here

19/06/2024

Faster than real-time detection of shot boundaries, sampling structure and dynamic keyframes in video

access here

Author(s)

Hannes Fassold;

Institution

Joanneum Research;

Abstract

The detection of shot boundaries (hardcuts and short dissolves), sampling structure (progressive / interlaced / pulldown) and dynamic keyframes in a video are fundamental video analysis tasks which have to be done before any further high-level analysis tasks. We present a novel algorithm which does all these analysis tasks in an unified way, by utilizing a combination of inter-frame and intra-frame measures derived from the motion field and normalized cross correlation. The algorithm runs four times faster than real-time due to sparse and selective calculation of these measures.

Access

Open Access

Type of Publication

Publication

Publisher

Conference on Imaging, Signal Processing and Communication

access here

18/06/2024

Will VISIONE Remain Competitive in Lifelog Image Search?

access here

Author(s)

Claudio Gennaro; Claudio Vairo; Fabio Carrara; Fabrizio Falchi; Giuseppe Amato; Lucia Vadicamo; Nicola Messina; Paolo Bolettieri;

Institution

ISTI-CNR;

Abstract

VISIONE is a versatile video retrieval system supporting diverse search functionalities, including free-text, similarity, and temporal searches. Its recent success in securing first place in the 2024 Video Browser Showdown (VBS) highlights its effectiveness.
Originally designed for analyzing, indexing, and searching diverse video content, VISIONE can also be adapted to images from lifelog cameras thanks to its reliance on frame-based representations and retrieval mechanisms.

In this paper, we present an overview of VISIONE’s core characteristics and the adjustments made to accommodate lifelog images. These adjustments primarily focus on enhancing result visualization within the GUI, such as grouping images by date or hour to align with lifelog dataset imagery. It’s important to note that while the GUI has been updated, the core search engine and visual content analysis components remain unchanged from the version presented at VBS 2024. Specifically, metadata such as local time, GPS coordinates, and concepts associated with images are not indexed or utilized in the system. Instead, the system relies solely on the visual content of the images, with date and time information extracted from their filenames, which are utilized exclusively within the GUI for visualization purposes.

Our objective is to evaluate the system’s performance within the Lifelog Search Challenge, emphasizing reliance on visual content analysis without additional metadata.

Access

Open Access

Type of Publication

Conference paper

Publisher

access here

11/06/2024

Visual Objectification in Films: Towards a New AI Task for Video Interpretation

access here

Author(s)

Bergman Clement Frédéric Precioso Julie Tores Léa Andolfi Lucile Sassatelli; Magali Guaresi Sarah Lecossais Thierry Devars Victor Ecrement Virginie Julliard Wu Hui-Yin

Institution

Inria; Institut Universitaire de France; Sorbonne Université Université Côte d'Azur; Université Sorbonne Paris Nord

Abstract

In film gender studies the concept of “male gaze” refers to the way the characters are portrayed on-screen as objects of desire rather than subjects. In this article we introduce a novel video-interpretation task to detect character objectification in films. The purpose is to reveal and quantify the usage of complex temporal patterns operated in cinema to produce the cognitive perception of objectification. We introduce the ObyGaze12 dataset made of 1914 movie clips densely annotated by experts for objectification concepts identified in film studies and psychology. We evaluate recent vision models show the feasibility of the task and where the challenges remain with concept bottleneck models. Our new dataset and code are made available to the community.

Access

Open Access

Type of Publication

Conference paper Publication

Publisher

IEEE Conference on Computer Vision and Pattern Recognition

access here

10/06/2024

Reassembling digital archives—strategies for counter-archiving

access here

Author(s)

Tobias Blanke;

Institution

University of Amsterdam;

Abstract

Archives have long been a key concern of academic debates about truth, memory, recording and power and are important sites for social sciences and humanities research. This has been the case for traditional archives, but these debates have accelerated with the digital transformation of archives. The proliferation of digital tools and the fast-growing increase in digital materials have created very large digitised and born-digital archives. This article investigates how new digital archives continue existing archival practices while at the same time discontinuing them. We present novel methodologies and tools for changing memory and power relations in digital archives through new ways of reassembling marginalised, non-canonical entities in digital archives. Reassembling digital archives can take advantage of the materiality and the algorithmic processuality of digital collections and reshape them to inscribe lost voices and previously ignored differences. Digital archives are not fixed and are changed with new research and political questions and are only identified through new questions. The article presents six distinct techniques and strategies to reassemble digital archives and renders these according to three different types of new digital archives. We consider both the extension of archives towards evidence that is otherwise thrown away as well as the provision of new intensive, non-discriminatory viewpoints on existing collections.

Access

Open Access

Type of Publication

Journal article

Publisher

N/A

access here

10/06/2024

[ Front Matter ] MAD ’24: Proceedings of the 3rd ACM International Workshop on Multimedia AI against Disinformation

access here

Author(s)

Adrian Popescu; Bogdan Ionescu; Cristian Stanciu; Giorgios Kordopatis-Zilos; Luca Cuccovillo; Roberto Cardelli; Symeon Papadopoulos

Institution

CEA; CERTH - Center for Research and Technology Hellas Czech Technical University in Prague; Fraunhofer IDMT; Mercatorum University; University Politehnica of Bucharest

Abstract

Front matter of the proceedings of the 3nd ACM International Workshop on Multimedia AI against Disinformation, held in Phuket (Thailand) on June 10th, 2024.

The full proceedings are available online at https://dl.acm.org/doi/proceedings/10.1145/3643491.

Access

Open Access

Type of Publication

Book section

Publisher

ACM Association for Computing Machinery

access here

07/06/2024

MAD ’24 Workshop: Multimedia AI against Disinformation

access here

Author(s)

Adrian Popescu; Bogdan Ionescu; Cristian Stanciu; Giorgios Kordopatis-Zilos; Luca Cuccovillo; Roberto Cardelli; Symeon Papadopoulos

Institution

CEA; CERTH - Center for Research and Technology Hellas Czech Technical University in Prague; Fraunhofer IDMT; Mercatorum University; University Politehnica of Bucharest

Abstract

Synthetic media generation and manipulation have seen rapid advancements in recent years, making it increasingly easy to create multimedia content that is indistinguishable to the human observer. Moreover, generated content can be used maliciously by individuals and organizations in order to spread disinformation, posing a significant threat to society and democracy. Hence, there is an urgent need for AI tools geared towards facilitating a timely and effective media verification process. The MAD’24 workshop seeks to bring together people with diverse backgrounds who are dedicated to combating disinformation in multimedia through the means of AI, by fostering an environment for exploring innovative ideas and sharing experiences. The research areas of interest encompass the identification of manipulated or generated content, along with the investigation of the dissemination of disinformation and its societal repercussions. Recognizing the significance of multimedia, the workshop emphasizes the joint analysis of various modalities within content, as verification can be improved by aggregating multiple forms of content.

Access

Open Access

Type of Publication

Conference paper

Publisher

ACM on Multimedia Retrieval

access here

05/06/2024

Towards Quantitative Evaluation of Explainable AI Methods for Deepfake Detection

access here

Author(s)

Evlampios Apostolidis; Konstantinos Tsigos Spyridon Baxevanakis; Symeon Papadopoulos Vasileios Mezaris

Institution

CERTH - Center for Research and Technology Hellas

Abstract

In this paper we propose a new framework for evaluating the performance of explanation methods on the decisions of a deepfake detector. This framework assesses the ability of an explanation method to spot the regions of a fake image with the biggest influence on the decision of the deepfake detector, by examining the extent to which these regions can be modified through a set of adversarial attacks, in order to flip the detector’s prediction or reduce its initial prediction; we anticipate a larger drop in deepfake detection accuracy and prediction, for methods that spot these regions more accurately. Based on this framework, we conduct a comparative study using a state-of-the-art model for deepfake detection that has been trained on the FaceForensics++ dataset, and five explanation methods from the literature. The findings of our quantitative and qualitative evaluations document the advanced performance of the LIME explanation method against the other compared ones, and indicate this method as the most appropriate for explaining the decisions of the utilized deepfake detector.

Access

Open Access

Type of Publication

Conference paper

Publisher

ACM on Multimedia Retrieval

access here

04/06/2024

A Human-Annotated Video Dataset for Training and Evaluation of 360-Degree Video Summarization Methods

access here

Author(s)

Evlampios Apostolidis; Ioannis Kontostathis; Vasileios Mezaris

Institution

CERTH - Center for Research and Technology Hellas

Abstract

In this paper we introduce a new dataset for 360-degree video summarization: the transformation of 360-degree video content to concise 2D-video summaries that can be consumed via traditional devices, such as TV sets and smartphones. The dataset includes ground-truth human-generated summaries, that can be used for training and objectively evaluating 360-degree video summarization methods. Using this dataset, we train and assess two state-of-the-art summarization methods that were originally proposed for 2D-video summarization, to serve as a baseline for future comparisons with summarization methods that are specifically tailored to 360-degree video. Finally, we present an interactive tool that was developed to facilitate the data annotation process and can assist other annotation activities that rely on video fragment selection.

Access

Open Access

Type of Publication

Conference paper

Publisher

access here

03/06/2024

MINTIME: Multi-Identity Size-Invariant Video Deepfake Detection

access here

Author(s)

Davide Alessandro Coccomini; Fabrizio Falchi; Giorgios Kordopatis-Zilos; Giuseppe Amato; Roberto Caldelli Symeon Papadopoulos

Institution

CERTH - Center for Research and Technology Hellas Czech Technical University in Prague; ISTI-CNR;

Abstract

In this paper, we present MINTIME, a video deepfake detection method that effectively captures spatial and temporal inconsistencies in videos that depict multiple individuals
and varying face sizes. Unlike previous approaches that either employ simplistic a-posteriori aggregation schemes, i.e., averaging or max operations, or only focus on the largest face in the video, our proposed method learns to accurately detect spatio-temporal inconsistencies across multiple identities in a video through a Spatio-Temporal Transformer combined with a Convolutional Neural Network backbone. This is achieved through an Identityaware Attention mechanism that applies a masking operation on the face sequence to process each identity independently, which enables effective video-level aggregation. Furthermore, our system incorporates two novel embedding schemes: (i) the Temporal Coherent Positional Embedding, which encodes the temporal information of the face sequences of each identity, and (ii) the Size Embedding, which captures the relative sizes of the faces to the video frames. MINTIME achieves state-of-the-art performance on the ForgeryNet dataset, with a remarkable improvement of up to 14% AUC in videos containing multiple people. Moreover, it demonstrates very robust generalization capabilities in crossforgery and cross-dataset settings. The code is publicly available at: https://github.com/davide-coccomini/MINTIME-Multi-Identity-size-iNvariant-TIMEsformer-for-Video-Deepfake-Detection

Access

Open Access

Type of Publication

Journal article

Publisher

N/A

access here

27/05/2024

Evaluating Performance and Trends in Interactive Video Retrieval: Insights From the 12th VBS Competition

access here

Author(s)

Cathal Gurrin; Fabio Carrara; Florian Spiess; Jakub Lokoč; Klaus Schoeffmann; Ladislav Peška; Loris Sauter; Luca Rossetto; Lucia Vadicamo; Minh-Triet Tran Nico Hezel Nicola Messina; Rahel Arnold; Sebastian Lubos Stefanos Vrochidis; Thao-Nhu Nguyen; Werner Bailer; Xingham Li Zhixin Ma;

Institution

CERTH - Center for Research and Technology Hellas Charles University; Dublin City University; HTW Berlin; Institute of Information Science and Technologies ISTI-CNR; Joanneum Research; University of Basel; University of Klagenfurt University of Zurich; Vietnam National University Wuhan University

Abstract

This paper conducts a thorough examination of the 12th Video Browser Showdown (VBS) competition, a well-established international benchmarking campaign for interactive video search systems.
The annual VBS competition has witnessed a steep rise in the popularity of multimodal embedding-based approaches in interactive video retrieval. Most of the thirteen systems participating in VBS 2023 utilized a CLIP-based cross-modal search model, allowing the specification of free-form text queries to search visual content. This shared emphasis on joint embedding models contributed to balanced performance across various teams. However, the distinguishing factors of the top-performing teams included the adept combination of multiple models and search modes, along with the capabilities of interactive interfaces to facilitate and refine the search process.
Our work provides an overview of the state-of-the-art approaches employed by the participating systems and conducts a thorough analysis of their search logs, which record user interactions and results of their queries for each task. Our comprehensive examination of the VBS competition offers assessments of the effectiveness of the retrieval models, browsing efficiency, and user query patterns. Additionally, it provides valuable insights into the evolving landscape of interactive video retrieval and its future challenges.

Access

Open Access

Type of Publication

Journal article

Publisher

IEEE Access

access here

25/05/2024

Universal Local Attractors on Graphs

access here

Author(s)

Emmanouil Krasanakis; Ioannis Kompatsiaris; Symeon Papadopoulos

Institution

CERTH - Center for Research and Technology Hellas

Abstract

Being able to express broad families of equivariant or invariant attributed graph functions is a popular measuring stick of whether graph neural networks should be employed in practical applications. However, it is equally important to find deep local minima of losses (i.e., produce outputs with much smaller loss values compared to other minima), even when architectures cannot express global minima. In this work we introduce the architectural property of attracting optimization trajectories to local minima as a means of achieving smaller loss values. We take first steps in satisfying this property for losses defined over attributed undirected unweighted graphs with an architecture called universal local attractor (ULA). This refines each dimension of end-to-end-trained node feature embeddings based on graph structure to track the optimization trajectories of losses satisfying some mild conditions. The refined dimensions are then linearly pooled to create predictions. We experiment on 11 tasks, from node classification to clique detection, on which ULA is comparable with or outperforms popular alternatives of similar or greater theoretical expressive power.

Access

Open Access

Type of Publication

Publication

Publisher

N/A

access here

01/05/2024

Elastic feature consolidation for cold start examplar-free incremental learning

access here

Author(s)

Albin Soutif-Cormerais Andrew Bagdanov Joost van de Weijer Simone Magistri Tommaso Trinci

Institution

Computer Vision Center University of Florence;

Abstract

Exemplar-Free Class Incremental Learning (EFCIL) aims to learn from a sequence of tasks without having access to previous task data. In this paper, we consider the challenging Cold Start scenario in which insufficient data is available in the first task to learn a high-quality backbone. This is especially challenging for EFCIL since it requires high plasticity, which results in feature drift which is difficult to compensate for in the exemplar-free setting. To address this problem, we propose a simple and effective approach that consolidates feature representations by regularizing drift in directions highly relevant to previous tasks and employs prototypes to reduce task-recency bias. Our method, called Elastic Feature Consolidation (EFC), exploits a tractable second-order approximation of feature drift based on an Empirical Feature Matrix (EFM). The EFM induces a pseudo-metric in feature space which we use to regularize feature drift in important directions and to update Gaussian prototypes used in a novel asymmetric cross entropy loss which effectively balances prototype rehearsal with data from new tasks. Experimental results on CIFAR-100, Tiny-ImageNet, ImageNet-Subset and ImageNet-1K demonstrate that Elastic Feature Consolidation is better able to learn new tasks by maintaining model plasticity and significantly outperform the state-of-the-art.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

29/04/2024

MAP-Elites with Transverse Assessment for Multimodal Problems in Creative Domains

access here

Author(s)

Antonios Liapis; Georgios N. Yannakakis; Marvin Zammit;

Institution

University of Malta

Abstract

The recent advances in language-based generative models have paved the way for the orchestration of multiple generators of different artefact types (text, image, audio, etc.) into one system. Presently, many open-source pre-trained models combine text with other modalities, thus enabling shared vector embeddings to be compared across different generators. Within this context we propose a novel approach to handle multimodal creative tasks using Quality Diversity evolution. Our contribution is a variation of the MAP-Elites algorithm, MAP-Elites with Transverse Assessment (MEliTA), which is tailored for multimodal creative tasks and leverages deep learned models that assess coherence across modalities. MEliTA decouples the artefacts’ modalities and promotes cross-pollination between elites. As a test bed for this algorithm, we generate text descriptions and cover images for a hypothetical video game and assign each artefact a unique modality-specific behavioural characteristic. Results indicate that MEliTA can improve text-to-image mappings within the solution space, compared to a baseline MAP-Elites algorithm that strictly treats each image-text pair as one solution. Our approach represents a significant step forward in multimodal bottom-up orchestration and lays the groundwork for more complex systems coordinating multimodal creative agents in the future.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

24/04/2024

Porting Large Language Models to Mobile Devices for Question Answering

access here

Author(s)

Hannes Fassold;

Institution

Joanneum Research;

Abstract

Deploying Large Language Models (LLMs) on mobile devices makes all the capabilities of natural language processing available on the device. An important use case of LLMs is question answering, which can provide accurate and contextually relevant answers to a wide array of user queries. We describe how we managed to port state of the art LLMs to mobile devices, enabling them to operate natively on the device. We employ the llama.cpp framework, a flexible and self-contained C++ framework for LLM inference. We selected a 6-bit quantized version of the Orca-Mini-3B model with 3 billion parameters and present the correct prompt format for this model. Experimental results show that LLM inference runs in interactive speed on a Galaxy S21 smartphone and that the model delivers high-quality answers to user queries related to questions from different subjects like politics, geography or history.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

12/04/2024

nSimplex Zen: a novel dimensionality reduction for euclidean and Hilbert spaces

access here

Author(s)

Connor Richard Lucia Vadicamo;

Institution

ISTI-CNR; University of St. Andrews

Abstract

Dimensionality reduction techniques map values from a high dimensional space to one with a lower dimension. The result is a space which requires less physical memory and has a faster distance calculation. These techniques are widely used where required properties of the reduced-dimension space give an acceptable accuracy with respect to the original space. Many such transforms have been described. They have been classified in two main groups: linear and topological. Linear methods such as Principal Component Analysis (PCA) and Random Projection (RP) define matrix-based transforms into a lower dimension of Euclidean space. Topological methods such as Multidimensional Scaling (MDS) attempt to preserve higher-level aspects such as the nearest-neighbour relation, and some may be applied to non-Euclidean spaces. Here, we introduce nSimplex Zen, a novel topological method of reducing dimensionality. Like MDS, it relies only upon pairwise distances measured in the original space. The use of distances, rather than coordinates, allows the technique to be applied to both Euclidean and other Hilbert spaces, including those governed by Cosine, Jensen–Shannon and Quadratic Form distances. We show that in almost all cases, due to geometric properties of high-dimensional spaces, our new technique gives better properties than others, especially with reduction to very low dimensions.

Access

Open Access

Type of Publication

Journal article

Publisher

ACM Transactions on Knowledge Discovery from Data

access here

09/04/2024

AIMultimediaLab at MediaEval 2023: Studying the Generalization of Media Memorability Methods.

access here

Author(s)

Bogdan Ionescu; Mihai Gabriel Constantin

Institution

University Politehnica of Bucharest

Abstract

Video memorability is one of the vital aspects of subjective multimedia perception and, as such, is closely and thoroughly studied in the computer vision literature. This paper presents the methods proposed by AIMultimediaLab for the generalization subtask of the 2023 edition of the Predicting Video Memorability task. We explore several methods for augmenting the training process for a video Vision Transformer network, aiming to increase the number of hard-to-predict samples in the training set in order to increase the robustness of the targeted AI model. Starting from our previous works, we analyze several visual features that define “hard-to-predict” samples, and based on these features, we augment the training data of our models to target those specific videos that pose problems for memorability prediction.

Access

Open Access

Type of Publication

Conference paper

Publisher

MediaEval

access here

26/03/2024

Interactive multimodal video search: an extended post-evaluation for the VBS 2022 competition

access here

Author(s)

Claudio Vairo; Fabio Carrara; Jakub Lokoč; Kai Uwe Barthel; Klaus Schoeffmann; Konstantin Schall; Ladislav Peška; Lucia Vadicamo; Werner Bailer;

Institution

HTW Berlin; ISTI-CNR; Joanneum Research; University of Klagenfurt

Abstract

CLIP-based text-to-image retrieval has proven to be very effective at the interactive video retrieval competition Video Browser Showdown 2022, where all three top-scoring teams had implemented a variant of a CLIP model in their system. Since the performance of these three systems was quite close, this post-evaluation was designed to get better insights on the differences of the systems and compare the CLIP-based text-query retrieval engines by introducing slight modifications to the original competition settings. An extended analysis of the overall results and the retrieval performance of all systems’ functionalities shows that a strong text retrieval model certainly helps, but has to be coupled with extensive browsing capabilities and other query-modalities to consistently solve known-item-search tasks in a large scale video database.

Access

Open Access

Type of Publication

Journal article

Publisher

International Journal of Multimedia Information Retrieval

access here

18/03/2024

Binary quantification and dataset shift: an experimental investigation

access here

Author(s)

Alejandro Moreo; Fabrizio Sebastiani; Pablo González

Institution

ISTI-CNR; University of Oviedo

Abstract

Quantification is the supervised learning task that consists of training predictors of the class prevalence values of sets of unlabelled data, and is of special interest when the labelled data on which the predictor has been trained and the unlabelled data are not IID, i.e., suffer from dataset shift. To date, quantification methods have mostly been tested only on a special case of dataset shift, i.e., prior probability shift; the relationship between quantification and other types of dataset shift remains, by and large, unexplored. In this work we carry out an experimental analysis of how current quantification algorithms behave under different types of dataset shift, in order to identify limitations of current approaches and hopefully pave the way for the development of more broadly applicable methods. We do this by proposing a fine-grained taxonomy of types of dataset shift, by establishing protocols for the generation of datasets affected by these types of shift, and by testing existing quantification methods on the datasets thus generated. One finding that results from this investigation is that many existing quantification methods that had been found robust to prior probability shift are not necessarily robust to other types of dataset shift. A second finding is that no existing quantification method seems to be robust enough to dealing with all the types of dataset shift we simulate in our experiments. The code needed to reproduce all our experiments is publicly available at https:// github. com/pglez82/quant_ datasetshift

Access

Open Access

Type of Publication

Journal article

Publisher

Data Mining and Knowledge Discovery

access here

13/03/2024

One-Shot Neural Face Reenactment via Finding Directions in GAN’s Latent Space

access here

Author(s)

Christos Tzelepis; Georgios Tzimiropoulos; Ioannis Patras; Stella Bounareli; Vasileios Argyriou;

Institution

Kingston University London; Queen Mary University of London; University of Nottingham

Abstract

In this paper, we present our framework for neural face/head reenactment whose goal is to transfer the 3D head orientation and expression of a target face to a source face. Previous methods focus on learning embedding networks for identity and head pose/expression disentanglement which proves to be a rather hard task, degrading the quality of the generated images. We take a different approach, bypassing the training of such networks, by using (fine-tuned) pre-trained GANs which have been shown capable of producing high-quality facial images. Because GANs are characterized by weak controllability, the core of our approach is a method to discover which directions in latent GAN space are responsible for controlling head pose and expression variations. We present a simple pipeline to learn such directions with the aid of a 3D shape model which, by construction, inherently captures disentangled directions for head pose, identity, and expression. Moreover, we show that by embedding real images in the GAN latent space, our method can be successfully used for the reenactment of real-world faces. Our method features several favorable properties including using a single source image (one-shot) and enabling cross-person reenactment. Extensive qualitative and quantitative results show that our approach typically produces reenacted faces of notably higher quality than those produced by state-of-the-art methods for the standard benchmarks of VoxCeleb1 & 2.

Access

Open Access

Type of Publication

Journal article

Publisher

access here

04/03/2024

Self-Supervised Facial Representation Learning with Facial Region Awareness

access here

Author(s)

Ioannis Patras; Zheng Gao;

Institution

Queen Mary University of London;

Abstract

Self-supervised pre-training has been proved to be effective in learning transferable representations that benefit various visual tasks. This paper asks this question: can self-supervised pre-training learn general facial representations for various facial analysis tasks? Recent efforts toward this goal are limited to treating each face image as a whole, i.e., learning consistent facial representations at the image-level, which overlooks the consistency of local facial representations (i.e., facial regions like eyes, nose, etc). In this work, we make a first attempt to propose a novel self-supervised facial representation learning framework to learn consistent global and local facial representations, Facial Region Awareness (FRA). Specifically, we explicitly enforce the consistency of facial regions by matching the local facial representations across views, which are extracted with learned heatmaps highlighting the facial regions. Inspired by the mask prediction in supervised semantic segmentation, we obtain the heatmaps via cosine similarity between the per-pixel projection of feature maps and facial mask embeddings computed from learnable positional embeddings, which leverage the attention mechanism to globally look up the facial image for facial regions. To learn such heatmaps, we formulate the learning of facial mask embeddings as a deep clustering problem by assigning the pixel features from the feature maps to them. The transfer learning results on facial classification and regression tasks show that our FRA outperforms previous pre-trained models and more importantly, using ResNet as the unified backbone for various tasks, our FRA achieves comparable or even better performance compared with SOTA methods in facial analysis tasks.

Access

Open Access

Type of Publication

Conference paper

Publisher

IEEE Conference on Computer Vision and Pattern Recognition

access here

15/02/2024

AI and Media

access here

Author(s)

Aleksandra Kuczerawy Lidia Dutklewicz; Noémie Krack Peggy Valcke

Institution

KU Leuven

Abstract

This chapter discusses how AI technologies permeate the media sector. It sketches opportunities and benefits of the use of AI in media content gathering and production, in media content distribution, in fact-checking and content moderation. The chapter then zooms in on ethical and legal risks raised by AI-driven media applications: lack of data availability, poor data quality and bias in training datasets, lack of transparency, risks for the right to freedom of expression, threats to media freedom and pluralism online, and threats to media independence. Finally, the
chapter introduces the relevant elements of the EU legal framework which aim to mitigate these risks, such as the Digital Services Act, the European Media Freedom Act proposal and the AI Act proposal.

Access

Open Access

Type of Publication

Book section

Publisher

Cambridge handbook on the law, ethics and policy of Artificial Intelligence

access here

31/01/2024

Filter-Pruning of Lightweight Face Detectors Using a Geometric Median Criterion

access here

Author(s)

Konstantinos Gkrispanis; Nikolaos Gkalelis; Vasileios Mezaris

Institution

CERTH - Center for Research and Technology Hellas

Abstract

Face detectors are becoming a crucial component of many applications, including surveillance, that often have to run on edge devices with limited processing power and memory. Therefore, there’s a pressing demand for compact face detection models that can function efficiently across resource-constrained devices. Over recent years, network pruning techniques have attracted a lot of attention from researchers. These methods haven’t been well examined in the context of face detectors, despite their expanding popularity. In this paper, we implement filter pruning on two already small and compact face detectors, named EXTD (Extremely Tiny Face Detector) and EResFD (Efficient ResNet Face Detector). The main pruning algorithm that we utilize is Filter Pruning via Geometric Median (FPGM), combined with the Soft Filter Pruning (SFP) iterative procedure. We also apply L1 Norm pruning, as a baseline to compare with the proposed approach. The experimental evaluation on the WIDER FACE dataset indicates that the proposed approach has the potential to further reduce the model size of already lightweight face detectors, with limited accuracy loss, or even with small accuracy gain for low pruning rates.

Access

Open Access

Type of Publication

Conference paper

Publisher

Winter Conference on Applications of Computer Vision

access here

31/01/2024

Facilitating the Production of Well-tailored Video Summaries for Sharing on Social Media

access here

Author(s)

Evlampios Apostolidis; Konstantinos Apostolidis; Vasileios Mezaris

Institution

CERTH - Center for Research and Technology Hellas

Abstract

This paper presents a web-based tool that facilitates the production of tailored summaries for online sharing on social media. Through an interactive user interface, it supports a “one-click” video summarization process. Based on the integrated AI models for video summarization and aspect ratio transformation, it facilitates the generation of multiple summaries of a full-length video according to the needs of target platforms with regard to the video’s length and aspect ratio.

Access

Open Access

Type of Publication

Conference paper

Publisher

Conference on Multimedia Modeling

access here

30/01/2024

An Integrated System for Spatio-Temporal Summarization of 360-degrees Videos

access here

Author(s)

Evlampios Apostolidis; Ioannis Kontostathis; Vasileios Mezaris

Institution

CERTH - Center for Research and Technology Hellas

Abstract

In this work, we present an integrated system for spatiotemporal summarization of 360-degrees videos. The video summary production mainly involves the detection of salient events and their synopsis into a concise summary. The analysis relies on state-of-the-art methods for saliency detection in 360-degrees video (ATSal and SST-Sal) and video summarization (CA-SUM). It also contains a mechanism that classifies a 360-degrees video based on the use of static or moving camera during recording and decides which saliency detection method will be used, as well as a 2D video production component that is responsible to create a conventional 2D video containing the salient events in the 360-degrees video. Quantitative evaluations using two datasets for 360-degrees video saliency detection (VR-EyeTracking, Sports-360) show the accuracy and positive impact of the developed decision mechanism, and justify our choice to use two different methods for detecting the salient events. A qualitative analysis using content from these datasets, gives further insights about the functionality of the decision mechanism, shows the pros and cons of each used saliency detection method and demonstrates the advanced performance of the trained summarization method against a more conventional approach.

Access

Open Access

Type of Publication

Conference paper

Publisher

Conference on Multimedia Modeling

access here

29/01/2024

Few-Shot Object Detection as a Service: Facilitating Training and Deployment for Domain Experts

access here

Author(s)

Bogdan Ionescu; Hannes Fassold; Mihai Dogariu Werner Bailer;

Institution

Joanneum Research; University Politehnica of Bucharest

Abstract

We propose a service-based approach for training few-shot object detectors and running inference with these models. This eliminates the need to write code or execute scripts, thus enabling domain experts to train their own detectors. The training service implements an efficient ensemble learning method in order to obtain more robust models without parameter search. The entire pipeline is deployed as a single container and can be controlled from a web user interface.

Access

Open Access

Type of Publication

Conference paper

Publisher

Multimedia Modeling Conference

access here

29/01/2024

VISIONE 5.0: Enhanced User Interface and AI Models for VBS2024

access here

Author(s)

Claudio Gennaro; Claudio Vairo; Fabio Carrara; Fabrizio Falchi; Giuseppe Amato; Lucia Vadicamo; Nicola Messina; Paolo Bolettieri;

Institution

ISTI-CNR;

Abstract

In this paper, we introduce the fifth release of VISIONE, an advanced video retrieval system offering diverse search functionalities. The user can search for a target video using textual prompts, drawing objects and colors appearing in the target scenes in a canvas, or images as query examples to search for video keyframes with similar content.
Compared to the previous version of our system, which was runner-up at VBS 2023, the forthcoming release, set to participate in VBS 2024, showcases a refined user interface that enhances its usability and updated AI models for more effective video content analysis.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

22/01/2024

SMEMO: Social Memory for Trajectory Forecasting

access here

Author(s)

Alberto Del Bimbo; Federico Becattini; Francesco Marchetti; Lorenzo Seidenari;

Institution

Università degli Studi di Firenze; University of Florence;

Abstract

Effective modeling of human interactions is of utmost importance when forecasting behaviors such as future trajectories. Each individual, with its motion, influences surrounding agents since everyone obeys to social non-written rules such as collision avoidance or group following. In this paper we model such interactions, which constantly evolve through time, by looking at the problem from an algorithmic point of view, i.e., as a data manipulation task. We present a neural network based on an end-to-end trainable working memory, which acts as an external storage where information about each agent can be continuously written, updated and recalled. We show that our method is capable of learning explainable cause-effect relationships between motions of different agents, obtaining state-of-the-art results on multiple trajectory forecasting datasets.

Access

Open Access

Type of Publication

Journal article

Publisher

IEEE Transaction on Pattern Analysis and Machine Intelligence

access here

09/01/2024

Cascaded Transformer-based Networks for Wikipedia Large-scale Image-Caption Matching

access here

Author(s)

Andrea Esuli; Davide Alessandro Coccomini; Fabrizio Falchi; Nicola Messina;

Institution

Institute of Information Science and Technologies

Abstract

With the increasing importance of multimedia and multilingual data in online encyclopedias, novel methods are needed to fill domain gaps and automatically connect different modalities for increased accessibility. For example, Wikipedia is composed of millions of pages written in multiple languages. Images, when present, often lack textual context, thus remaining conceptually floating and harder to find and manage.
In this work, we tackle the novel task of associating images from Wikipedia pages with the correct caption among a large pool of available ones written in multiple languages, as required by the image-caption matching Kaggle challenge organized by the Wikimedia Foundation. A system able to perform this task would improve the accessibility and completeness of the underlying multi-modal knowledge graph in online encyclopedias. We propose a cascade of two models powered by the recent Transformer networks able to efficiently and effectively infer a relevance score between the query image data and the captions. We verify through extensive experiments that the proposed cascaded approach effectively handles a large pool of images and captions while maintaining bounded the overall computational complexity at inference time.
With respect to other approaches in the challenge leaderboard, we can achieve remarkable improvements over the previous proposals (+8% in nDCG@5 with respect to the sixth position) with constrained resources.

Access

Open Access

Type of Publication

Journal article

Publisher

Multimedia Tools and Applications

access here

03/01/2024

An Analysis of Initial Training Strategies for Exemplar-Free Class-Incremental Learning

access here

Author(s)

Adrian Popescu; Bertrand Delezoide Céline Hudelot; David Picard; Eva Feillet; Grégoire Petit; Michael Soumm

Institution

CEA; Université Gustave Eiffel; Université Paris-Saclay;

Abstract

Class-Incremental Learning (CIL) aims to build classification models from data streams. At each step of the CIL process, new classes must be integrated into the model. Due to catastrophic forgetting, CIL is particularly challenging when examples from past classes cannot be stored, the case on which we focus here. To date, most approaches are based exclusively on the target dataset of the CIL process. However, the use of models pre-trained in a self-supervised way on large amounts of data has recently gained momentum.
The initial model of the CIL process may only use the first batch of the target dataset, or also use pre-trained weights obtained on an auxiliary dataset. The choice between these two initial learning strategies can significantly influence the performance of the incremental learning model, but has not yet been studied in depth. Performance is also influenced by the choice of the CIL algorithm, the neural architecture, the nature of the target task, the distribution of classes in the stream and the number of examples available for learning.
We conduct a comprehensive experimental study to assess the roles of these factors. We present a statistical analysis framework that quantifies the relative contribution of each
factor to incremental performance. Our main finding is that the initial training strategy is the dominant factor influencing the average incremental accuracy, but that the choice of
CIL algorithm is more important in preventing forgetting.
Based on this analysis, we propose practical recommendations for choosing the right initial training strategy for a given incremental learning use case. These recommendations are intended to facilitate the practical deployment of incremental learning.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

03/01/2024

Joint-Based Action Progress Prediction

access here

Author(s)

Alberto Del Bimbo; Davide Pucci Federico Becattini;

Institution

Università degli Studi di Firenze; University of Florence;

Abstract

Action understanding is a fundamental computer vision branch for several applications, ranging from surveillance to robotics. Most works deal with localizing and recognizing the action in both time and space, without providing a characterization of its evolution. Recent works have addressed the prediction of action progress, which is an estimate of how far the action has advanced as it is performed. In this paper, we propose to predict action progress using a different modality compared to previous methods: body joints. Human body joints carry very precise information about human poses, which we believe are a much more lightweight and effective way of characterizing actions and therefore their execution. Estimating action progress can in fact be determined based on the understanding of how key poses follow each other during the development of an activity. We show how an action progress prediction model can exploit body joints and integrate it with modules providing keypoint and action information in order to be run directly from raw pixels. The proposed method is experimentally validated on the Penn Action Dataset.

Access

Open Access

Type of Publication

Journal article

Publisher

MDPI

access here

02/01/2024

Who’s in My Archive? An End-to-End Framework for Automatic Annotation of TV Personalities

access here

Author(s)

Alberto Messina; Angelo Bruccoleri; Fulvio Negro; Maurizio Montagnuolo; Roberto Iacoviello;

Institution

RAI;

Abstract

Knowledge about the presence of people in a video is a valuable source of information in many applications, such as video annotation, retrieval and summarisation. The contribution of this paper goes in the direction of demonstrating how AI-based face processing technologies can be profitably used to perform video annotation of television content. To validate our vision, we developed the Face Management Framework (FMF), which implements an end-to-end pipeline for face analysis and content annotation based on few-shot or zero-shot face embedding extraction models. The results of the test campaign of the system show that the key performance indicators that we defined were exceeded by a wide margin, demonstrating how media workflows could greatly benefit from the tool and the efficiency improvements it brings.

Access

Open Access

Type of Publication

Conference paper

Publisher

International Conference on Big Data

access here

01/01/2024

Deepfake detection by exploiting surface anomalies: the SurFake approach

access here

Author(s)

Alberto Del Bimbo; Andrea Ciamarra Federico Becattini; Lorenzo Seidenari; Roberto Cardelli;

Institution

Mercatorum University; University of Florence;

Abstract

The ever-increasing use of synthetically generated content in different sectors of our everyday life, one for all media information, poses a strong need for deepfake detection tools in order to avoid the proliferation of altered messages. The process to identify manipulated content, in particular images and videos, is basically performed by looking for the presence of some inconsistencies and/or anomalies specifically due to the fake generation process. Different techniques exist in the scientific literature that exploit diverse ad-hoc features in order to highlight possible modifications. In this paper, we propose to investigate how deepfake creation can impact on the characteristics that the whole scene had at the time of the acquisition. In particular, when an image (video) is captured the overall geometry of the scene (e.g. surfaces) and the acquisition process (e.g. illumination) determine a univocal environment that is directly represented by the image pixel values; all these intrinsic relations are possibly changed by the deepfake generation process. By resorting to the analysis of the characteristics of the surfaces depicted in the image it is possible to obtain a descriptor usable to train a CNN for deepfake detection: we refer to such an approach as SurFake. Experimental results carried out on the FF + + dataset for different kinds of deep-fake forgeries and diverse deep learning models confirm that such a feature can be adopted to discriminate between pristine and altered images; furthermore, experiments witness that it can also be combined with visual data to provide a certain improvement in terms of detection accuracy.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

01/01/2024

Addressing Limitations of State-Aware Imitation Learning for Autonomous Driving

access here

Author(s)

Alberto Del Bimbo; Federico Becattini; Lorenzo Seidenari; Luca Cultrera; Pietro Pala

Institution

Università degli Studi di Firenze; University of Florence;

Abstract

Conditional Imitation learning is a common and effective approach to train autonomous driving agents. However, two issues limit the full potential of this approach: (i) the inertia problem, a special case of causal confusion where the agent mistakenly correlates low speed with no acceleration, and (ii) low correlation between offline and online performance due to the accumulation of small errors that brings the agent in a previously unseen state. Both issues are critical for state-aware models, yet informing the driving agent of its internal state as well as the state of the environment is of crucial importance. In this article we propose a multi-task learning agent based on a multi-stage vision transformer with state token propagation. We feed the state of the vehicle along with the representation of the environment as a special token of the transformer and propagate it throughout the network. This allows us to tackle the aforementioned issues from different angles: guiding the driving policy with learned stop/go information, performing data augmentation directly on the state of the vehicle and visually explaining the model’s decisions. We report a drastic decrease in inertia and a high correlation between offline and online metrics.

Access

Open Access

Type of Publication

Journal article

Publisher

IEEE Transactions on Intelligent Vehicles

access here

30/12/2023

VISIONE for newbies: an easier-to-use video retrieval system

access here

Author(s)

Claudio Gennaro; Claudio Vairo; Fabio Carrara; Fabrizio Falchi; Giuseppe Amato; Lucia Vadicamo; Nicola Messina; Paolo Bolettieri;

Institution

ISTI-CNR;

Abstract

This paper presents a revised version of the VISIONE video retrieval system, which offers a wide range of search functionalities, including free text search, spatial color and object search, visual and semantic similarity search, and temporal search. The system is designed to ensure scalability using advanced indexing techniques and effectiveness using cutting-edge Artificial Intelligence technology for visual content analysis. VISIONE was the runner-up in the 2023 Video Browser Showdown competition, demonstrating its comprehensive video retrieval capabilities. In this paper, we detail the improvements made to the search and browsing interface to enhance its usability for non-expert users.
A demonstration video of our system with the restyled interface, showcasing its capabilities on over 2,300 hours of diverse video content, is available online at {\url{https://youtu.be/srD3TCUkMSg}}.

Access

Open Access

Type of Publication

Conference paper

Publisher

Conference on Content-based Multimedia Indexing

access here

30/12/2023

Assessing the difficulty of predicting media memorability

access here

Author(s)

Andrei Cosmin Jitaru Bogdan Ionescu; Mihai Dogariu Mihai Gabriel Constantin

Institution

University Politehnica of Bucharest

Abstract

Memorability is a critical aspect of human cognition that has been studied extensively in various fields, including psychology, education, and computer vision. The ability to remember information and experiences over time is essential for learning, decision-making, and creating lasting impressions. While the number of computer vision works that attempt to predict the memorability score of videos has recently seen a significant boost, thanks to several benchmarking tasks and datasets, some questions related to the performance of automated systems on certain types of videos are still largely unexplored. Given this, we are interested in discerning what makes a video sample easy or hard to classify or predict from a memorability standpoint. In this paper, we use a large set of runs, created and submitted by the participants to the MediaEval Predicting Video Memorability task, and, using their results and a set of visual, object, and annotator-based features and analyses, we attempt to find and define common traits that make the memorability scores of videos hard or easy to predict.

Access

Open Access

Type of Publication

Conference paper

Publisher

Conference on Content-based Multimedia Indexing

access here

16/12/2023

FairSISA: Ensemble Post-Processing to Improve Fairness of Unlearning in LLMs

access here

Author(s)

Ambrish Rawat; Anisa Halimi; Nathalie Baracaldo; Swanand Kadhe;

Institution

IBM Research;

Abstract

Training large language models (LLMs) is a costly endeavour in terms of time and computational resources. The large amount of training data used during the unsupervised pre-training phase makes it difficult to verify all data and, unfortunately, undesirable data may be ingested during training. Re-training from scratch is impractical and has led to the creation of the unlearning discipline where models are modified to “unlearn” undesirable information without retraining. However, any modification can alter the behaviour of LLMs, especially on key dimensions such as fairness. This is the first work that examines this interplay between unlearning and fairness for LLMs. In particular, we focus on a popular unlearning framework known as SISA [Bourtoule et al., 2021], which creates an ensemble of models trained on disjoint shards. We evaluate the performance-fairness trade-off for SISA, and empirically demsontrate that SISA can indeed reduce fairness in LLMs. To remedy this, we propose post-processing bias mitigation techniques for ensemble models produced by SISA. We adapt the post-processing fairness improvement technique from [Hardt et al., 2016] to design three methods that can handle model ensembles, and prove that one of the methods is an optimal fair predictor for ensemble of models. Through experimental results, we demonstrate the efficacy of our post-processing framework called FairSISA.

Access

Open Access

Type of Publication

Conference paper

Publisher

Socially Responsible Language Modelling Research

access here

13/12/2023

Argument-based Detection and Classification of Fallacies in Political Debates

access here

Author(s)

Elena Cabrio Mariana Chaves Pierpaolo Goffredo Serena Villata

Institution

CNRS Inria; Université Côte d'Azur;

Abstract

Fallacies are arguments that employ faulty reasoning. Given their persuasive and seemingly valid nature, fallacious arguments are often used in political debates. Employing these misleading arguments in politics can have detrimental consequences for society, since they can lead to inaccurate conclusions and invalid inferences from the public opinion and the policymakers. Automatically detecting and classifying fallacious arguments represents therefore a crucial challenge to limit the spread of misleading or manipulative claims and promote a more informed and healthier political discourse. Our contribution to address this challenging task is twofold. First, we extend the ElecDeb60To16 dataset of U.S. presidential debates annotated with fallacious arguments, by incorporating the most recent Trump-Biden presidential debate. We include updated token level annotations, incorporating argumentative components (i.e., claims and premises), the relations between these components (i.e., support and attack), and six categories of fallacious arguments (i.e., Ad Hominem, Appeal to Authority, Appeal to Emotion, False Cause, Slippery Slope, and Slogans). Second, we perform the twofold task of fallacious argument detection and classification by defining neural network architectures based on Transformers models, combining text, argumentative features, and engineered features. Our results show the advantages of complementing transformer-generated text representations with non-textual features.

Access

Open Access

Type of Publication

Conference paper

Publisher

Association for Computational Linguistics Empirical Methods in Natural Language Processing

access here

04/12/2023

Advancing Audio Phylogeny: A Neural Network Approach for Transformation Detection

access here

Author(s)

Luca Cuccovillo; Milica Gerhardt; Patrick Aichroth;

Institution

Fraunhofer IDMT;

Abstract

In this study we propose a novel approach to audio phylogeny, i.e. the detection of relationships and transformations within a set of near-duplicate audio items, by leveraging a deep neural network for efficiency and extensibility. Unlike existing methods, our approach detects transformations between nodes in one step, and the transformation set can be expanded by retraining the neural network without excessive computational costs. We evaluated our method against the state of the art using a self-created and publicly released dataset, observing a superior performance in reconstructing phylogenetic trees and heightened transformation detection accuracy. Moreover, the ability to detect a wide range of transformations and to extend the transformation set make the approach suitable for various applications.

Access

Open Access

Type of Publication

Conference paper

Publisher

IEEE International Workshop of Information Forensics and Security

access here

28/11/2023

Towards Evaluating Image Recommendations in Digital News and Media Ecosystem

access here

Author(s)

Christina Katsini; George E. Raptis; Vasilis Theodorou;

Institution

Human Opsis

Abstract

The News and Media landscape has undergone significant transformations in recent years, driven by the rise of new technologies and the widespread use of social media. This evolution introduces unique challenges for professionals working within this environment (e.g., journalists, content creators, and news authors), with a major one being the efficient sourcing of images that complement article content. In response to this challenge, we developed VIREO, a tool that recommends images based on textual content. In this paper, we make a step towards the practical effectiveness of VIREO’s core models in recommending images for real-world articles, with a specific focus on image recommendation efficiency. Our results indicate that VIREO offers a promising solution for professionals seeking to meet the evolving demands of the News and Media landscape while maintaining content quality and engagement.

Access

Open Access

Type of Publication

Conference paper

Publisher

International Conference on Computer and Applications

access here

25/11/2023

A deep learning-based pipeline for whitefly pest abundance estimation on chromotropic sticky traps

access here

Author(s)

Angelo Canale; Fabrizio Falchi; Giovanni Benelli; Giuseppe Amato; Luca Ciampi; Luca Incrocci; Stefano Chessa; Valeria Zeni;

Institution

ISTI-CNR; University of Pisa

Abstract

Integrated Pest Management (IPM) is an essential approach used in smart agriculture to manage pest populations and sustainably optimize crop production. One of the cornerstones underlying IPM solutions is pest monitoring, a practice often performed by farm owners by using chromotropic sticky traps placed on insect hot spots to gauge pest population densities. In this paper, we propose a modular model-agnostic deep learning-based counting pipeline for estimating the number of insects present in pictures of chromotropic sticky traps, thus reducing the need for manual trap inspections and minimizing human effort. Additionally, our solution generates a set of raw positions of the counted insects and confidence scores expressing their reliability, allowing practitioners to filter out unreliable predictions. We train and assess our technique by exploiting PST – Pest Sticky Traps, a new collection of dot-annotated images we created on purpose and we publicly release, suitable for counting whiteflies. Experimental evaluation shows that our proposed counting strategy can be a valuable Artificial Intelligence-based tool to help farm owners to control pest outbreaks and prevent crop damages effectively. Specifically, our solution achieves an average counting error of approximately 9% compared to human capabilities requiring a matter of seconds, a large improvement respecting the time-intensive process of manual human inspections, which often take hours or even days.

Access

Open Access

Type of Publication

Journal article

Publisher

Ecological Informatics

access here

24/11/2023

Semantic Generative Augmentations for Few-Shot Counting

access here

Author(s)

Hervé Le Borgne; Michel Crucianu; Nicolas Audebert; Perla Doubinsky

Institution

CEA; Conservatoire National des Arts et Métiers;

Abstract

With the availability of powerful text-to-image diffusion models, recent works have explored the use of synthetic data to improve image classification performances. These works show that it can effectively augment or even replace real data. In this work, we investigate how synthetic data can benefit few-shot class-agnostic counting. This requires to generate images that correspond to a given input number of objects. However, text-to-image models struggle to grasp the notion of count. We propose to rely on a double conditioning of Stable Diffusion with both a prompt and a density map in order to augment a training dataset for few-shot counting. Due to the small dataset size, the fine-tuned model tends to generate images close to the training images. We propose to enhance the diversity of synthesized images by exchanging captions between images thus creating unseen configurations of object types and spatial layout. Our experiments show that our diversified generation strategy significantly improves the counting accuracy of two recent and performing few-shot counting models on FSC147 and CARPK.

Access

Open Access

Type of Publication

Conference paper

Publisher

Winter Conference on Applications of Computer Vision

access here

23/11/2023

An Explainable Knowledge Graph-Based News Recommendation System

access here

Author(s)

Patrick Aichroth; Thomas Köllmer Zühal Kurt

Institution

Atilim University Fraunhofer IDMT;

Abstract

The paper outlines an explainable knowledge graph-based recommendation system that aims to provide personalized news recommendations and tries to explain why an item is recommended to a particular user. The system leverages a knowledge graph (KG) that models the relationships between items and users’ preferences, as well as external knowledge sources such as item features and user profiles. The main objectives of this study are to train a recommendation model that can predict whether a user will click on a news article or not, and then obtain the explainable recommendations for the same purpose. This is achieved with three steps: Firstly, KG of the MIND dataset are generated based on the history and, the clicked information of the users, the category and subcategory of the news. Then, the path reasoning approaches are utilized to reach explainable paths of recommended news/items. Thirdly, the proposed KG-based model is evaluated using MIND News data sets. Experiments have been conducted using the MIND-demo and MINDsmall datasets, which are the open-source English news datasets for public research scope. Experimental results indicate that the proposed approach performs better in terms of recommendation explainability, making it a promising basis for developing transparent and interpretable recommendation systems.

Access

Open Access

Type of Publication

Conference paper Publication

Publisher

Conference on Knowledge Discovery

access here

16/11/2023

Experimenting Task-specific LLMs

access here

Author(s)

Alberto Messina; Stefano Scotta;

Institution

RAI;

Abstract

In this work, we present an example of how a relatively small Large Language Model (LLM) fine-tuned to perform a simple and well defined task (assigning titles to news articles) could perform similarly or even better than huge LLMs which are created to respond to any question. This approach of specializing smaller LLMsonsimplertasksisalsointeresting because it goes in the direction of makingthis technology more sustainable and available to a higher number of entities that usually could not use these expensive models, both for economic and data policy reasons. We also present a couple of examples of how can be evaluated the performances of LLMs when the task is specified as in the example that we present in this work.

Access

Open Access

Type of Publication

Conference paper

Publisher

International Conference of the Italian Association for Artificial Intelligence

access here

13/11/2023

ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models

access here

Author(s)

Albert Gatt Andrea Pedrotti; Anette Frank Aykut Erdem Emre Can Acikgoz Erkut Erdem Iacer Calixto Ilker Kesen Leticia Parcalabescu Michele Cafagna Mustafa Dogan

Institution

Hacettepe University Heidelberg University Institute of Information Science and Technologies Koç University University of Amsterdam; University of Malta University of Pisa Utrecht University;

Abstract

With the ever-increasing popularity of pretrained Video-Language Models (VidLMs), there is a pressing need to develop robust evaluation methodologies that delve deeper into their visio-linguistic capabilities. To address this challenge, we present ViLMA (Video Language Model Assessment), a task-agnostic benchmark that places the assessment of fine-grained capabilities of these models on a firm footing. Task-based evaluations, while valuable, fail to capture the complexities and specific temporal aspects of moving images that VidLMs need to process. Through carefully curated counterfactuals, ViLMA offers a controlled evaluation suite that sheds light on the true potential of these models, as well as their performance gaps compared to human-level understanding. ViLMA also includes proficiency tests, which assess basic capabilities deemed essential to solving the main counterfactual tests. We show that current VidLMs’ grounding abilities are no better than those of vision-language models which use static images. This is especially striking once the performance on proficiency tests is factored in. Our benchmark serves as a catalyst for future research on VidLMs, helping to highlight areas that still need to be explored.

Access

Open Access

Type of Publication

Conference paper

Publisher

access here

11/11/2023

DiffDefense: Defending Against Adversarial Attacks via Diffusion Models

access here

Author(s)

Alberto Del Bimbo; Hondamunige Prasanna Silva Lorenzo Seidenari;

Institution

University of Florence;

Abstract

This paper presents a novel reconstruction method that leverages Diffusion Models to protect machine learning classifiers against adversarial attacks, all without requiring any modifications to the classifiers themselves. The susceptibility of machine learning models to minor input perturbations renders them vulnerable to adversarial attacks. While diffusion-based methods are typically disregarded for adversarial defense due to their slow reverse process, this paper demonstrates that our proposed method offers robustness against adversarial threats while preserving clean accuracy, speed, and plug-and-play compatibility. Code at: https://github.com/HondamunigePrasannaSilva/DiffDefence.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

10/11/2023

Visual Identification of Volumetric Structures from 3D Medical CT Data Using Density and Visibility-based Quality Criteria

access here

Author(s)

Marius Gavrilescu;

Institution

Technical University of Iasi

Abstract

The identification of important structures from volume data is a challenging problem in information visualization due to the complexity and amount of detail found in volume data sets. In particular, medical imaging devices generate scans which contain a significant amount of important anatomical structures, some of which are hidden, occluded or otherwise difficult to highlight. Conventional density and gradient-based classification methods fail to uncover such structures, thereby creating the necessity for more elaborate visualization methods and the involvement of multiple visual criteria in order to generate quality representations of the volume data. We propose a volume visualization approach which extends the conventional rendering pipeline by incorporating visibility-based quality criteria into the color and opacity mapping process. Our method consists in using two stacked transfer functions which handle visual mappings: one based on the density domain of the data set, and the other on a custom metric which quantifies the visibility of volumetric structures. We show that this arrangement allows the generation of improved representations of meaningful hidden structures from medical CT data, while constituting a reliable means of identifying volumetric details not representable using traditional approaches.

Access

Open Access

Type of Publication

Conference paper

Publisher

E-Health and Bioengineering Conference 2023

access here

29/10/2023

A Study on the Use of Attention for Explaining Video Summarization

access here

Author(s)

Evlampios Apostolidis; Ioannis Patras; Vasileios Mezaris

Institution

CERTH - Center for Research and Technology Hellas Queen Mary University of London;

Abstract

In this paper we present our study on the use of attention for explaining video summarization. We build on a recent work that formulates the task, called XAI-SUM, and we extend it by: a) taking into account two additional network architectures and b) introducing two novel explanation signals that relate to the entropy and diversity of attention weights. In total, we examine the effectiveness of seven types of explanation, using three state-of-the-art attention-based network architectures (CA-SUM, VASNet, SUM-GDA) and two datasets (SumMe, TVSum) for video summarization. The conducted evaluations show that the inherent attention weights are more suitable for explaining network architectures which integrate mechanisms for estimating attentive diversity (SUM-GDA) and uniqueness (CA-SUM). The explanation of simpler architectures (VASNet) can benefit from taking into account estimates about the strength of the input vectors, while another option is to consider the entropy of attention weights.

Access

Open Access

Type of Publication

Conference paper

Publisher

ACM Multimedia

access here

29/10/2023

4DSR-GCN: 4D Video Point Cloud Upsampling using Graph Convolutional Networks

access here

Author(s)

Alberto Del Bimbo; Lorenzo Berlincioni; Marco Bertini; Stefano Berretti

Institution

Università degli Studi di Firenze; University of Florence;

Abstract

Time-varying sequences of 3D point clouds, or 4D point clouds, are now being acquired at an increasing pace in several applications (personal avatar representation, LiDAR in autonomous or assisted driving). In many cases, such volume of data is transmitted, thus requiring that proper compression tools are applied to either reduce the resolution or the bandwidth. In this paper, we propose a new solution for upscaling and restoration of time-varying 3D video point clouds after they have been heavily compressed. Our model consists of a specifically designed Graph Convolutional Network that combines Dynamic Edge Convolution and Graph Attention Networks for feature aggregation in a Generative Adversarial setting. We present a different way to sample dense point clouds with the intent to make these modules work in synergy to provide each node with enough features about its neighbourhood in order to later on generate new vertices. Compared to other solutions in the literature that address the same task, our proposed model is capable of obtaining comparable results in terms of quality of the reconstruction, while using a substantially lower number of parameters (\simeq 300KB), making our solution deployable in edge computing devices.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

27/10/2023

An Open Dataset of Synthetic Speech

access here

Author(s)

Artem Yaroshchuk; Christoforos Papastergiopoulos Dimitrios Tzovaras; Konstantinos Votis; Luca Cuccovillo; Patrick Aichroth;

Institution

CERTH - Center for Research and Technology Hellas Fraunhofer IDMT;

Abstract

This paper introduces a multilingual, multispeaker dataset composed of synthetic and natural speech, designed to foster research and benchmarking in synthetic speech detection. The dataset encompasses 18,993 audio utterances synthesized from text, alongside with their corresponding natural equivalents, representing approximately 17 hours of synthetic audio data. The dataset features synthetic speech generated by 156 voices spanning three languages, namely, English, German, and Spanish, with a balanced gender representation. It targets state-of-the-art synthesis methods, and has been released with a license allowing seamless extension and redistribution by the research community.

Access

Open Access

Type of Publication

Conference paper

Publisher

IEEE International Workshop of Information Forensics and Security

access here

27/10/2023

Vec2Doc: transforming dense vectors into sparse representations for efficient information retrieval

access here

Author(s)

Claudio Gennaro; Fabio Carrara; Giuseppe Amato; Lucia Vadicamo;

Institution

ISTI-CNR;

Abstract

The rapid development of deep learning and artificial intelligence has transformed our approach to solving scientific problems across various domains, including computer vision, natural language processing, and automatic content generation. Information retrieval (IR) has also experienced significant advancements, with natural language understanding and multimodal content analysis enabling accurate information retrieval. However, the widespread adoption of neural networks has also influenced the focus of IR problem-solving, which nowadays predominantly relies on evaluating the similarity of dense vectors derived from the latent spaces of deep neural networks. Nevertheless, the challenges of conducting similarity searches on large-scale databases with billions of vectors persist. Traditional IR approaches use inverted indices and vector space models, which work well with sparse vectors. In this paper, we propose Vec2Doc, a novel method that converts dense vectors into sparse integer vectors, allowing for the use of inverted indices. Preliminary experimental evaluation shows a promising solution for large-scale vector-based IR problems.

Access

Open Access

Type of Publication

Conference paper

Publisher

International Conference on Similarity Search and Applications

access here

25/10/2023

Explainable Video Summarization for Advancing Media Content Production

access here

Author(s)

Evlampios Apostolidis; Georgios Balaouras; Ioannis Patras; Vasileios Mezaris

Institution

CERTH - Center for Research and Technology Hellas Queen Mary University of London;

Abstract

This chapter focuses on explainable video summarization, a technology that could significantly advance the content production workflow of Media organizations. It starts by presenting the current state of the art in the fields of deep-learning-based video summarization and explainable video analysis and understanding. Following, it focuses on video summarization methods that rely on the use of attention mechanisms and reports on previous works that investigated the use of attention for explaining the outcomes of deep neural networks. Subsequently, it briefly describes a state-of-the-art attention-based architecture for unsupervised video summarization and discusses a recent work that examines the use of various attention-based signals for explaining the outcomes of video summarization. Finally, it provides recommendations about future research directions.

Access

Open Access

Type of Publication

Book section

Publisher

Encyclopedia of Information Science and Technology

access here

20/10/2023

ProGAP: Progressive Graph Neural Networks with Differential Privacy Guarantees

access here

Author(s)

Daniel Gatica-Perez; Sina Sajadmanesh

Institution

Idiap Research Institute

Abstract

Graph Neural Networks (GNNs) have become a popular tool for learning on graphs, but their widespread use raised privacy concerns as graph data can contain personal or sensitive information. Differentially private GNN models have been recently proposed to preserve privacy while still allowing for effective learning over graph-structured datasets. however, achieving an ideal balance between accuracy and privacy in GNNs remains challenging due to the intrinsic structural connectivity of graphs. in this paper, we propose a new differentially private GNN called ProGAP that uses a progressive training scheme to improve such accuracy-privacy trade-offs. Combined with the aggregation perturbation technique to ensure differential privacy, ProGAP splits a GNN into a sequence of overlapping submodels that are trained progressively, expanding from the first submodel to the complete model. Specifically, each submodel is trained over the privately aggregated node embeddings learned and cached by the previous submodels, leading to an increased expressive power compared to previous approaches while limiting the incurred privacy costs. We formally prove that ProGAP ensures edge-level and node-level privacy guarantees for both training and inference stages, and evaluate its performance on benchmark graph datasets. Experimental results demonstrate that ProGAP can achieve up to 5-10% higher accuracy than existing state-of-the-art differentially private GNNs. Our code is available at https://github.com/sisaman/ProGAP.

Access

Open Access

Type of Publication

Publication

Publisher

N/A

access here

18/10/2023

Regularization-Based Methods for Ordinal Quantification

access here

Author(s)

Alejandro Moreo; Fabrizio Sebastiani; Martin Senz; Mirko Bunse;

Institution

ISTI-CNR; University of Dortmund;

Abstract

Quantification, i.e., the task of training predictors of the class prevalence values in sets of unlabeled data items, has received increased attention in recent years. However, most quantification research has concentrated on developing algorithms for binary and multiclass problems in which the classes are not ordered. Here, we study the ordinal case, i.e., the case in which a total order is defined on the set of n > 2 classes. We give three main contributions to this field. First, we create and make available two datasets for ordinal quantification (OQ) research that overcome the inadequacies of the previously available ones. Second, we experimentally compare the most important OQ algorithms proposed in the literature so far. To this end, we bring together algorithms proposed by authors from very different research fields, such as data mining and astrophysics, who were unaware of each others’ developments. Third, we propose a novel class of regularized OQ algorithms, which outperforms existing algorithms in our experiments. The key to this gain in performance is that our regularization prevents ordinally implausible estimates, assuming that ordinal distributions tend to be smooth in practice. We informally verify this assumption for several real-world applications.

Access

Open Access

Type of Publication

Publication

Publisher

Data Mining and Knowledge Discovery

access here

13/10/2023

Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion Data and Natural Language

access here

Author(s)

Fabrizio Falchi; Jan Sedmidubsky; Nicola Messina; Tomás Rebok;

Institution

ISTI-CNR; Masaryk University;

Abstract

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

13/10/2023

Representation of Coherent Structures from Volume Data Using Quality-oriented Features and Genetic Optimization

access here

Author(s)

Florin Leon; Marius Gavrilescu; Sabina-Adriana Floria;

Institution

Technical University of Iasi

Abstract

Representing relevant information from volume data sets is a problem often faced in visualization. Generating meaningful images from highly-complex volume data sets is a challenging, tedious task requiring specialized knowledge of the distribution and properties of the data. Traditionally, this task has been carried out manually via specialized user interfaces. We propose a volume visualization pipeline which facilitates the automatic generation of high-quality images from volume data sets. Our method involves a direct volume renderer which generates images from volume data based on visual mappings provided by a transfer function. Central to our approach is a quality-focused descriptor which exploits the properties of the distribution of gradient orientations of an alpha-bounded surface within the volume. This feature is useful for determining transfer functions that result in the rendering of corresponding images depicting various details from the volume. We show that by using this feature as an optimization objective, the generation of high quality images can be automated. Using simple genetic algorithms, we can automatically generate sets of images illustrating coherent, easily-distinguishable and high-quality surfaces of relevant structures from volume data.

Access

Open Access

Type of Publication

Conference paper

Publisher

International Conference on System Theory

access here

12/10/2023

Improving Query and Assessment Quality in Text-Based Video Retrieval Evaluation

access here

Author(s)

Anastasios Gkagkas; Davide Alessandro Coccomini; Gylfi Þór Guðmundsson; Jakub Lokoč; Jiaxin Wu; Nick Pantelidis; Nicola Messina; Rahel Arnold; Silvan Heller; Vera Benz; Werner Bailer;

Institution

CERTH - Center for Research and Technology Hellas Charles University; City University Hong Kong; ISTI-CNR; Joanneum Research; Reykjavik University; University of Basel;

Abstract

Different task interpretations are a highly undesired element in interactive video retrieval evaluations. When a participating team focuses partially on a wrong goal, the evaluation results might become partially misleading. In this paper, we propose a process for refining known-item and open-set type queries, and preparing the assessors that judge the correctness of submissions to open-set queries. Our findings from recent years reveal that a proper methodology can lead to objective query quality improvements and subjective participant satisfaction with query clarity.

Access

Open Access

Type of Publication

Conference paper

Publisher

Conference on Multimedia Retrieval

access here

09/10/2023

Predicting Player Engagement in Tom Clancy’s The Division 2: A Multimodal Approach via Pixels and Gamepad Actions

access here

Author(s)

David Renaudie; Georgios N. Yannakakis; Konstantinos Makantasis; Kosmas Pinitas; Matthew Barthet Mike Thomsen;

Institution

Massive Entertainment Ubisoft University of Malta

Abstract

This paper introduces a large scale multimodal corpus collected for the purpose of analysing and predicting player engagement in commercial-standard games. The corpus is solicited from 25 players of the action role-playing game Tom Clancy’s The Division 2, who annotated their level of engagement using a time-continuous annotation tool. The cleaned and processed corpus presented in this paper consists of nearly 20 hours of annotated gameplay videos accompanied by logged gamepad actions. We report preliminary results on predicting long-term player engagement based on in-game footage and game controller actions using Convolutional Neural Network architectures. Results obtained suggest we can predict the player engagement with up to accuracy on average ( at best) when we fuse information from the game footage and the player’s controller input. Our findings validate the hypothesis that long-term (i.e. 1 hour of play) engagement can be predicted efficiently solely from pixels and gamepad actions.

Access

Open Access

Type of Publication

Paper

Publisher

Conference on Multimodal Interaction

access here

08/10/2023

Selecting a Diverse Set of Aesthetically-pleasing and Representative Video Thumbnails using Reinforcement Learning

access here

Author(s)

Evlampios Apostolidis; Georgios Balaouras; Ioannis Patras; Vasileios Mezaris

Institution

CERTH - Center for Research and Technology Hellas Queen Mary University of London;

Abstract

This paper presents a new reinforcement-based method for video thumbnail selection (called RL-DiVTS), that relies on estimates of the aesthetic quality, representativeness and visual diversity of a small set of selected frames, made with the help of tailored reward functions. The proposed method integrates a novel diversity-aware Frame Picking mechanism that performs a sequential frame selection and applies a reweighting process to demote frames that are visually-similar to the already selected ones. Experiments on two benchmark datasets (OVP and YouTube), using the top-3 matching evaluation protocol, show the competitiveness of RL-DiVTS against other SoA video thumbnail selection and summarization approaches from the literature.

Access

Open Access

Type of Publication

Paper

Publisher

IEEE International Conference on Image Processing

access here

02/10/2023

Leveraging Visual Attention for out-of-distribution Detection

access here

Author(s)

Alberto Del Bimbo; Lorenzo Seidenari; Luca Cultrera;

Institution

University of Florence;

Abstract

Out-of-Distribution (OOD) detection is a crucial challenge in computer vision, especially when deploying machine learning models in the real world. In this paper, we propose a novel OOD detection method leveraging Visual Attention Heatmaps from a Vision Transformer (ViT) classifier. Our approach involves training a Convolutional Autoencoder to reconstruct attention heatmaps produced by a ViT classifier, enabling accurate image reconstruction and effective OOD detection. Moreover, our method does not require additional labels during training, ensuring efficiency and ease of implementation. We validate our approach on a standard OOD benchmark using CIFAR10 and CIFAR100. To test OOD in a real-world setting we also collected a novel dataset: WildCapture. Our new dataset comprises more than 60k wild animal shots, from 15 different wildlife species, taken via phototraps in varying lighting conditions. The dataset is fully annotated with animal bounding boxes and species.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

30/09/2023

A Report on Modern Evolutionary Methods for Quality-driven Optimization in Information Visualization, Image Processing and Design Problems

access here

Author(s)

Cristian-Nicolae Butincu; Florin Leon; Lavinia-Eugenia Ferariu; Marius Gavrilescu;

Institution

Technical University of Iasi

Abstract

This report describes our research and documentation efforts in searching and analyzing the related literature for existing applications of evolutionary algorithms for quality-oriented optimization. We present our findings in terms of multiple relevant results from the related state of the art. We mainly divide the results into two broad categories: classic single- and multi-objective optimization, and quality-diversity (QD) methods. While we mostly focus on evolutionary optimization applied in visualization and image-processing, we also present some results from other fields which we considered relevant. This report was originally submitted as documentation for the deliverables of the VolEvol project.

Access

Open Access

Type of Publication

Report

Publisher

N/A

access here

30/09/2023

The Emotions of the Crowd: Learning Image Sentiment from Tweets via Cross-modal Distillation

access here

Author(s)

Fabio Carrara; Fabrizio Falchi; Maurizio Tesconi;

Institution

ISTI-CNR; University of Pisa

Abstract

Trends and opinion mining in social media increasingly focus on novel interactions involving visual media, like images and short videos, in addition to text.

In this work, we tackle the problem of visual sentiment analysis of social media images — specifically, the prediction of image sentiment polarity. While previous work relied on manually labeled training sets, we propose an automated approach for building sentiment polarity classifiers based on a cross-modal distillation paradigm; starting from scraped multimodal (text + images) data, we train a student model on the visual modality based on the outputs of a textual teacher model that analyses the sentiment of the corresponding textual modality.

We applied our method to randomly collected images crawled from Twitter over three months and produced, after automatic cleaning, a weakly-labeled dataset of ∼1.5 million images. Despite exploiting noisy labeled samples, our training pipeline produces classifiers showing strong generalization capabilities and outperforming the current state of the art on five manually labeled benchmarks for image sentiment polarity prediction.

Access

Open Access

Type of Publication

Publication

Publisher

ECAI - European Conference on Artificial Intelligence

access here

28/09/2023

Eliciting and Annotating Emotion in Virtual Spaces

access here

Author(s)

Ali Najm; Antonios Liapis; Despina Michael-Grigoriou; Emmanouil Xylakis; Georgios N. Yannakakis;

Institution

Cyprus University of Technology; University of Malta

Abstract

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

22/09/2023

Knowledge Distillation-driven Communication Framework for Neural Networks: Enabling Efficient Student-Teacher Interactions

access here

Author(s)

Ioanna Valsamara; Ioannis Pitas;

Institution

Aristotle University of Thessaloniki;

Abstract

Access

Open Access

Type of Publication

Preprint

Publisher

N/A

access here

22/09/2023

Deep Reinforcement Learning with semi-expert distillation for autonomous UAV cinematography

access here

Author(s)

Andreas Sochopoulos; Evangelos Charalampakis; Ioannis Mademlis; Ioannis Pitas; Sotirios Papadopoulos

Institution

Aristotle University of Thessaloniki;

Abstract

Access

Open Access

Type of Publication

Preprint

Publisher

N/A

access here

22/09/2023

Towards Human Society-inspired Decentralized DNN Inference

access here

Author(s)

Dimitrios Papaioannou; Ioannis Pitas; Vasileios Mygdalis

Institution

Aristotle University of Thessaloniki;

Abstract

Access

Open Access

Type of Publication

Preprint

Publisher

N/A

access here

22/09/2023

Political Tweet Sentiment Analysis For Public Opinion Polling

access here

Author(s)

Anestis Kaimakamadis; Ioannis Pitas;

Institution

Aristotle University of Thessaloniki;

Abstract

Access

Open Access

Type of Publication

Preprint

Publisher

N/A

access here

22/09/2023

Quantifying the knowledge in Deep Neural Networks: an overview

access here

Author(s)

Ioanna Valsamara; Ioannis Mademlis; Ioannis Pitas;

Institution

Aristotle University of Thessaloniki;

Abstract

Access

Open Access

Type of Publication

Preprint

Publisher

N/A

access here

22/09/2023

GreekPolitics: Sentiment Analysis on Greek Politically Charged Tweets

access here

Author(s)

Emmanouil Krasanakis; Ioanna Valsamara; Ioannis Mademlis; Ioannis Pitas;

Institution

Aristotle University of Thessaloniki;

Abstract

Access

Open Access

Type of Publication

Preprint

Publisher

N/A

access here

20/09/2023

Wasserstein Loss for Semantic Editing in the Latent Space of GANs

access here

Author(s)

Hervé Le Borgne; Michel Crucianu; Nicolas Audebert; Perla Doubinsky

Institution

CNAM; Université Paris-Saclay;

Abstract

The latent space of GANs contains rich semantics reflecting the training data. Different methods propose to learn edits in latent space corresponding to semantic attributes, thus allowing to modify generated images. Most supervised methods rely on the guidance of classifiers to produce such edits. However, classifiers can lead to out-of-distribution regions and be fooled by adversarial samples. We propose an alternative formulation based on the Wasserstein loss that avoids such problems, while maintaining performance on-par with classifier-based approaches. We demonstrate the effectiveness of our method on two datasets (digits and faces) using StyleGAN2.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

18/09/2023

Proceedings of the 3rd International Workshop on Learning to Quantify (LQ 2023)

access here

Author(s)

Alejandro Moreo; Fabrizio Sebastiani; Mirko Bunse; Pablo González

Institution

Consiglio Nazionale delle Ricerche; University of Oviedo

Abstract

Access

Open Access

Type of Publication

Book

Publisher

N/A

access here

16/09/2023

Automatic Generation of Images from Hurricane Volume Data Through Evolutionary Optimization

access here

Author(s)

Florin Leon; Marius Gavrilescu;

Institution

Technical University of Iasi

Abstract

The study of hurricanes through information visualization and visual analysis is useful for tracking and understanding the behavior and impact of such hazardous natural phenomena. Images obtained from data commonly acquired through meteorological radar provide scientists with a visual representation of the storm’s characteristics, such as its location, size, and intensity. Such information is useful for forecasting, decision making in disaster management and environmental and human health risk assessment. Visual representations of such phenomena can help emergency responders and policymakers make informed decisions about evacuations, disaster response, and resource allocation. In this context, we propose an automated means of generating representations from complex 3D datasets obtained from meteorological radar scans of regions affected by hurricanes, illustrating the geometry and spatial features of such phenomena.

Access

Open Access

Type of Publication

Conference paper

Publisher

International Conference on Environmental Engineering and Management

access here

14/09/2023

Temporal Normalization in Attentive Key-frame Extraction for Deep Neural Video Summarization

access here

Author(s)

Ioannis Mademlis; Ioannis Pitas; Michail Kaseris

Institution

Aristotle University of Thessaloniki;

Abstract

Access

Open Access

Type of Publication

Preprint

Publisher

N/A

access here

13/09/2023

DDSP-Piano: A Neural Sound Synthesizer Informed by Instrument Knowledge

access here

Author(s)

Axel Roebel; Lenny Renault; Rémi Mignot;

Institution

Sorbonne Université

Abstract

Access

Open Access

Type of Publication

Journal article

Publisher

Journal of the Audio Engineering Society

access here

06/09/2023

A survey of manifold learning and its applications for multimedia

access here

Author(s)

Hannes Fassold;

Institution

Joanneum Research;

Abstract

Manifold learning is an emerging research domain of machine learning. In this work, we give an introduction into manifold learning and how it is employed for important application fields in multimedia.

Access

Open Access

Type of Publication

Conference paper

Publisher

Conference on Video and Signal Processing

access here

05/09/2023

MC-GTA: A Synthetic Benchmark for Multi-Camera Vehicle Tracking

access here

Author(s)

Claudio Gennaro; Fabrizio Falchi; Gaetano Emanuele Valenti; Giuseppe Amato; Luca Ciampi; Nicola Messina;

Institution

ISTI-CNR; University of Pisa

Abstract

Access

Open Access

Type of Publication

Conference paper

Publisher

Conference on Image Analysis and Processing

access here

05/09/2023

An Optimized Pipeline for Image-Based Localization in Museums from Egocentric Images

access here

Author(s)

Antonino Furnari; Claudio Gennaro; Fabrizio Falchi; Giovanni Maria Farinella; Nicola Messina;

Institution

ISTI-CNR; University of Catania;

Abstract

Access

Open Access

Type of Publication

Journal article

Publisher

Conference on Image Analysis and Processing

access here

04/09/2023

Expressive Piano Performance Rendering from Unpaired Data

access here

Author(s)

Axel Roebel; Lenny Renault; Rémi Mignot;

Institution

Sorbonne Université

Abstract

Access

Open Access

Type of Publication

Conference paper

Publisher

International Conference on Digital Audio Effects

access here

01/09/2023

ISF-GAN: An Implicit Style Function for High Resolution Image-to-Image Translation

access here

Author(s)

Bruno Lepri; Linchao Bao; Marco de Nadai; Nicu Sebe; Yahui Liu Yajing Chen;

Institution

FBK; Tencent AI Lab University of Trento;

Abstract

Access

Closed Access

Type of Publication

Journal article

Publisher

IEEE Transactions on Multimedia

access here

31/08/2023

Criteria for the Objective Assessment of Quality and Diversity in Volume Visualization

access here

Author(s)

Marius Gavrilescu;

Institution

Technical University of Iasi

Abstract

Objective quality assessment in volume visualization is a crucial process aimed at quantifying the quality of rendered volumetric images or animations using measurable metrics and algorithms. This approach is essential to ensure that the visualizations accurately represent the underlying data and meet specific quality standards. The assessment of quality in computer graphics, visualization and image processing is a complex task, particularly due to the number of scenarios, use cases and problems encountered in the aforementioned fields, and also due to the subjective nature of quality. To this extent, we search for methods, algorithms and metrics that can be used by an optimizer to search for rendering parameters such that the resulting images adhere to our formulations on what constitutes quality. At the same time, similar metrics can be exploited such that the space of possible parameters can be more thoroughly explored, resulting in populations of images exhibiting diverse content. This document presents our findings in terms of approaches that constitute good candidates for quality and diversity criteria, to be used as objectives and/or for defining feature spaces when automatically generating images from volume data. This report was originally submitted as documentation for the deliverables of the VolEvol project.

Access

Open Access

Type of Publication

Report

Publisher

N/A

access here

30/08/2023

Knowing Your Annotator: Rapidly Testing the Reliability of Affect Annotation

access here

Author(s)

Antonios Liapis; Chintan Triverdi; Emmanouil Xylakis; Georgios N. Yannakakis; Konstantinos Makantasis; Kosmas Pinitas; Matthew Barthet

Institution

University of Malta

Abstract

Access

Open Access

Type of Publication

Conference paper

Publisher

Conference on Affective Computing and Intelligent Interaction Workshops and Demos

access here

28/08/2023

Towards Enhancing the Media Industry through AI-Driven Image Recommendations

access here

Author(s)

Christina Katsini; George E. Raptis; Vasilis Theodorou;

Institution

Human Opsis

Abstract

In a fast-changing media ecosystem, professionals and enterprises in the News and Media industry face new challenges that they should address to maximize their productivity and improve their services. The rise of alternative news sources, such as social media, the leading news source, especially for young people, has led to emerging requirements in the News and Media industry. A core requirement is publishing articles as fast as possible on various platforms, combining visual and textual content. Accompanying news with images raises the readers’ interest, improves engagement, and recall. Therefore, the News and Media industry professionals must adapt their publication strategies to meet this requirement and the media consumers’ expectations. However, the selection of the appropriate images is a time-consuming and manual task. Towards this direction, we propose VIREO, which addresses this challenge by providing professionals (e.g., journalists) with an integrated digital solution that automatically recommends a collection of images that could accompany an article. VIREO implements text and image analysis and matching processes leveraging AI techniques in real time to achieve this. VIREO aims to benefit both professionals (e.g., journalists) by suggesting appealing images that accompany the textual content of their articles and create breath-taking stories and the media consumers (e.g., readers) by delivering an enhanced reading experience, engagement, and recall.

Access

Open Access

Type of Publication

Conference paper

Publisher

Human-Computer Interaction

access here

25/08/2023

Matching Pairs: Attributing Fine-Tuned Models to their Pre-Trained Large Language Models

access here

Author(s)

Ambrish Rawat; Gabriele Picco; Giulio Zizzo; Myles Foley; Taesung Lee; Yufang Hou;

Institution

IBM Research; Imperial College London;

Abstract

The wide applicability and adaptability of large language models (LLM) has enabled their rapid adoption. While the pre-trained models can perform many tasks, such models are often fine-tuned to improve their performance. However, this leads to issues over violation of model licenses, model theft, and copyright infringement. Moreover, recent advances show that generative technology is capable of producing harmful content which exacerbates the problems of accountability within model supply chains. Thus, we need a method to investigate how a model was trained or a piece of text was generated and what their source pre-trained model was. In this paper we take a first step to addressing this open problem by tracing back the origin of a given fine-tuned LLM to its corresponding pre-trained base model. We consider different knowledge levels and attribution strategies, and find that we are able to trace back to the original base model with an AUC of 0.804.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

25/08/2023

Prompting Visual-Language Models for Dynamic Facial Expression Recognition

access here

Author(s)

Ioannis Patras; Zengqun Zhao;

Institution

Queen Mary University of London;

Abstract

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

24/08/2023

Interactive Video Retrieval in the Age of Effective Joint Embedding Deep Models: Lessons from the 11th VBS

access here

Author(s)

Aaron Duane; Cathal Gurrin; Florian Spiess; Jakub Lokoč; Klaus Schoeffmann; Konstantin Schall; Ladislav Peška; Loris Sauter; Luca Rossetto; Lucia Vadicamo; Nicola Messina; Omar Shahbaz Khan; Stefanos Vrochidis; Stelios Andreadis; Thao-Nhu Nguyen; Werner Bailer; Zhixin Ma;

Institution

CERTH - Center for Research and Technology Hellas Charles University; Dublin City University; HTW Berlin; ISTI-CNR; IT University of Copenhagen; Joanneum Research; Klagenfurt University; Singapore Management University; University of Basel; University of Copenhagen; University of Zurich;

Abstract

This paper presents the findings of the eleventh Video Browser Showdown competition, where sixteen teams competed in known-item and ad-hoc search tasks. Many of the teams utilized state-of-the-art video retrieval approaches that demonstrated high effectiveness in challenging search scenarios. In the paper, a broad survey of all utilized approaches is presented in connection
with an analysis of the performance of participating teams. Specifically, both high-level performance indicators are presented with overall statistics as well as an in-depth analysis of the performance of selected tools implementing result set logging. The analysis reveals evidence that the CLIP model represents a versatile tool for cross-modal video retrieval when combined with interactive search capabilities. Furthermore, the analysis investigates the effect of different users and text query properties on the performance in search tasks. Last but not least, lessons learned from search task preparation are presented, and a new direction for adhoc search based tasks at Video Browser Showdown is introduced.

Access

Open Access

Type of Publication

Journal article

Publisher

N/A

access here

23/08/2023

Exploiting One-class classification optimization objectives for increasing adversarial robustness

access here

Author(s)

Ioannis Pitas; Vasileios Mygdalis

Institution

Aristotle University of Thessaloniki;

Abstract

This work examines the problem of increasing the robustness of deep neural network-based image classification systems to adversarial attacks, without changing the neural architecture or employ adversarial examples in the learning process. We attribute their famous lack of robustness to the geometric properties of the deep neural network embedding space, derived from standard optimization options, which allow minor changes in the intermediate activation values to trigger dramatic changes to the decision values in the final layer. To counteract this effect, we explore optimization criteria that supervise the distribution of the intermediate embedding spaces, in a class-specific basis, by introducing and leveraging one-class classification objectives. The proposed learning procedure compares favorably to recently proposed training schemes for adversarial robustness in black-box adversarial attack settings.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

23/08/2023

Properties of learning Multiplicative Universal Adversarial Perturbations in image data

access here

Author(s)

Alexandros Zamichos; Ioannis Pitas; Vasileios Mygdalis

Institution

Aristotle University of Thessaloniki;

Abstract

Adversarial attacks in image classification are optimization problems that estimate the minimum perturbation required for a single input image, so the neural network misclassifies it. Universal adversarial perturbations are adversarial attacks that target a whole dataset, estimated by e.g., accumulating the perturbations for each image using standard adversarial attacks. This work treats the universal adversarial perturbation as a problem of transformation estimation. As such, we propose to learn an iterative transformation that maps “clean” images to a “perturbed” domain, by exploiting adversarial attacks. Our experiments show that the proposed formulation leads to easy generation of the adversarial perturbation, while it introduces less noise in the perturbed images, when compared to the state-of-the-art. Finally, this formulation allows us to explore additional properties, notably reversibility of the transformation and attainability of the transformation by using dataset samples.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

23/08/2023

AdvRevGan: On Reversible Universal Adversarial Attacks for privacy protection applications

access here

Author(s)

Ioannis Pitas; Stefania Altini; Vasileios Mygdalis

Institution

Aristotle University of Thessaloniki;

Abstract

Different adversarial attack methods have been proposed in the literature, mainly focusing on attack efficiency and visual quality, e.g., similarity with the non-adversarial examples. These properties enable the use of adversarial attacks for privacy protection against automated classification systems, while maintaining utility for human users. In this paradigm, when privacy restrictions are lifted, access to the original data should be restored, for all stakeholders. This paper addresses exactly this problem. Existing adversarial attack methods cannot reconstruct the original data from the adversarial ones, leading to significant storage overhead for all privacy applications. To solve this issue, we propose AdvRevGAN, a novel Neural Network architecture that generates reversible adversarial examples. We evaluate our approach in classification problems, where we examine the case where adversarial attacks are constructed by a neural network, while the original images are reconstructed using the reverse transformation from the adversarial examples. We show that adversarial attacks using this approach maintain and even increase their efficiency, while the classification accuracy of the model in the reconstructed data can almost totally be restored.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

23/08/2023

Real-time object geopositioning from monocular target detection/tracking for aerial cinematography

access here

Author(s)

Daniel Aláez; Ioannis Pitas; Jesús Villadangos; Vasileios Mygdalis

Institution

Aristotle University of Thessaloniki; University of Navarre;

Abstract

In recent years, the field of automated aerial cinematography has seen a significant increase in demand for real-time 3D target geopositioning for motion and shot planning. To this end, many of the existing cinematography plans require the use of complex sensors that need to be equipped on the subject or rely on external motion systems. This work addresses this problem by combining monocular visual target detection and tracking with a simple ground intersection model. Under the assumption that the targets to be filmed typically stand on the ground, 3D target localization is achieved by estimating the direction and the norm of the look-at vector. The proposed algorithm employs an error estimation model that accounts for the error in detecting the bounding box, the height estimation errors, and the uncertainties of the pitch and yaw angles. This algorithm has been fully implemented in a heavy-lifting aerial cinematography hexacopter, and its performance has been evaluated through experimental flights. Results show that typical errors are within 5 meters of absolute distance and 3 degrees of angular error for distances to the target of around 100 meters.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

20/07/2023

HyperReenact: One-Shot Reenactment via Jointly Learning to Refine and Retarget Faces

access here

Author(s)

Christos Tzelepis; Georgios Tzimiropoulos; Ioannis Patras; Stella Bounareli; Vasileios Argyriou;

Institution

Kingston University London; Queen Mary University of London;

Abstract

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

09/07/2023

MAD-TSC: A Multilingual Aligned News Dataset for Target-dependent Sentiment Classification

access here

Author(s)

Adrian Popescu; Armelle Brun; Evan Dufraisse; Jérôme Deshayes-Chossart; Julien Tourille;

Institution

Université de Lorraine; Université Paris-Saclay;

Abstract

Target-dependent sentiment classification (TSC) enables a fine-grained automatic analysis of sentiments expressed in texts.
Sentiment expression varies depending on the domain, and it is necessary to create domain-specific datasets.
While socially important, TSC in the news domain remains relatively understudied.
We introduce MAD-TSC, the first multilingual aligned dataset designed for TSC in news. MAD-TSC differs substantially from existing resources.
First, it includes aligned examples in eight languages to facilitate a comparison of performance for individual languages, and a direct comparison of human and machine translation.
Second, the dataset is sampled from a diversified parallel news corpus, and is diversified in terms of news sources and geographic spread of entities.
Finally, MAD-TSC is more challenging than existing datasets because its samples are more complex.
We exemplify the use of MAD-TSC with comprehensive monolingual and multilingual experiments.
The latter shows that machine translations can successfully replace manual ones, and that performance for all included languages can match that of English by automatically translating test examples.

Access

Open Access

Type of Publication

Conference paper

Publisher

Conference on Computational Linguistics N/A

access here

07/07/2023

Audio Splicing Detection and Localization Based on Acquisition Device Traces

access here

Author(s)

Daniele Ugo Leonzio; Luca Cuccovillo; Marco Marcon; Paolo Bolettieri; Patrick Aichroth; Stefano Tubaro;

Institution

Fraunhofer IDMT; Politecnico di Milano;

Abstract

In recent years, the multimedia forensic community has put a great effort in developing solutions to assess the integrity and authenticity of multimedia objects, focusing especially on manipulations applied by means of advanced deep learning techniques. However, in addition to complex forgeries as the deepfakes, very simple yet effective manipulation techniques not involving any use of state-of-the-art editing tools still exist and prove dangerous. This is the case of audio splicing for speech signals, i.e., to concatenate and combine multiple speech segments obtained from different recordings of a person in order to cast a new fake speech. Indeed, by simply adding a few words to an existing speech we can completely alter its meaning. In this work, we address the overlooked problem of detection and localization of audio splicing from different models of acquisition devices. Our goal is to determine whether an audio track under analysis is pristine, or it has been manipulated by splicing one or multiple segments obtained from different device models. Moreover, if a recording is detected as spliced, we identify where the modification has been introduced in the temporal dimension. The proposed method is based on a Convolutional Neural Network (CNN) that extracts model-specific features from the audio recording. After extracting the features, we determine whether there has been a manipulation through a clustering algorithm. Finally, we identify the point where the modification has been introduced through a distance-measuring technique. The proposed method allows to detect and localize multiple splicing points within a recording.

Access

Open Access

Type of Publication

Journal article

Publisher

Multimedia FORensics in the WILD

access here

03/07/2023

Explaining autonomous driving with visual attention and end-to-end trainable region proposals

access here

Author(s)

Alberto Del Bimbo; Federico Becattini; Lorenzo Seidenari; Luca Cultrera;

Institution

University of Florence;

Abstract

Autonomous driving is advancing at a fast pace, with driving algorithms becoming more and more accurate and reliable. Despite this, it is of utter importance to develop models that can offer a certain degree of explainability in order to be trusted, understood and accepted by researchers and, especially, society. In this work we present a conditional imitation learning agent based on a visual attention mechanism in order to provide visually explainable decisions by design. We propose different variations of the method, relying on end-to-end trainable regions proposal functions, generating regions of interest to be weighed by an attention module. We show that visual attention can improve driving capabilities and provide at the same time explainable decisions.

Access

Open Access

Type of Publication

Journal article

Publisher

N/A

access here

01/07/2023

Orthogonal SVD Covariance Conditioning and Latent Disentanglement

access here

Author(s)

Nicu Sebe; Wei Wang; Yue Song

Institution

Beijing Jiaotong University; University of Trento;

Abstract

Access

Closed Access

Type of Publication

Journal article

Publisher

IEEE Transaction on Pattern Analysis and Machine Intelligence

access here

01/07/2023

100-Driver: A Large-scale, Diverse Dataset for Distracted Driver Classification

access here

Author(s)

Fang Li; Jing Wang; Jun Zhang; Wengjing Li; Zhongcheng Wu; Zhun Zhong

Institution

Chinese Academy of Science; University of Trento;

Abstract

Access

Closed Access

Type of Publication

Journal article

Publisher

Transactions on Intelligent Transportation Systems;

access here

01/07/2023

Latent Traversals in Generative Models as Potential Flows

access here

Author(s)

Andy Keller; Max Welling; Nicu Sebe; Yue Song

Institution

University of Amsterdam; University of Trento;

Abstract

Access

Open Access

Type of Publication

Conference paper

Publisher

International Conference on Machine Learning

access here

30/06/2023

JGNN: Graph Neural Networks on Native Java

access here

Author(s)

Emmanouil Krasanakis; Ioannis Kompatsiaris; Symeon Papadopoulos

Institution

CERTH - Center for Research and Technology Hellas

Abstract

Access

Open Access

Type of Publication

Journal article

Publisher

N/A

access here

30/06/2023

PI-Trans: Parallel-ConvMLP and Implicit-Transformation Based GAN for Cross-View Image Translation

access here

Author(s)

Bin Ren; Hao Tang; Nicu Sebe; Wei Wang; Xia Li; Yiming Wang;

Institution

Beijing Jiaotong University; ETH Zurich; FBK; University of Trento;

Abstract

For semantic-guided cross-view image translation, it is crucial to learn where to sample pixels from the source view image and where to reallocate them guided by the target view semantic map, especially when there is little overlap or drastic view difference between the source and target images. Hence, one not only needs to encode the longrange dependencies among pixels in both the source view image and target view semantic map but also needs to translate these learned dependencies. To this end, we propose a novel generative adversarial network, PI-Trans, which mainly consists of a novel Parallel-ConvMLP module and an Implicit Transformation module at multiple semantic levels. Extensive experimental results show that PI-Trans achieves the best qualitative and quantitative performance by a large margin compared to the state-of-the-art methods on two challenging datasets. The source code is available at https://github.com/Amazingren/PI-Trans.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

22/06/2023

MaskCon: Masked Contrastive Learning for Coarse-Labelled Dataset

access here

Author(s)

Chen Feng; Ioannis Patras;

Institution

Queen Mary University of London;

Abstract

Deep learning has achieved great success in recent years with the aid of advanced neural network structures and large-scale human-annotated datasets. However, it is often costly and difficult to accurately and efficiently annotate large-scale datasets, especially for some specialized domains where fine-grained labels are required. In this setting, coarse labels are much easier to acquire as they do not require expert knowledge. In this work, we propose a contrastive learning method, called masked contrastive learning (MaskCon) to address the under-explored problem setting, where we learn with a coarse-labelled dataset in order to address a finer labelling problem. More specifically, within the contrastive learning framework, for each sample our method generates soft-labels with the aid of coarse labels against other samples and another augmented view of the sample in question. By contrast to self-supervised contrastive learning where only the sample’s augmentations are considered hard positives, and in supervised contrastive learning where only samples with the same coarse labels are considered hard positives, we propose soft labels based on sample distances, that are masked by the coarse labels. This allows us to utilize both inter-sample relations and coarse labels. We demonstrate that our method can obtain as special cases many existing state-of-the-art works and that it provides tighter bounds on the generalization error. Experimentally, our method achieves significant improvement over the current state-of-the-art in various datasets, including CIFAR10, CIFAR100, ImageNet-1K, Standford Online Products and Stanford Cars196 datasets. Code and annotations are available at https://github.com/ MrChenFeng/MaskCon_CVPR2023.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

19/06/2023

Self-Supervised Video Similarity Learning

access here

Author(s)

Christos Tzelepis; Giorgios Kordopatis-Zilos; Giorgios Tolias; Ioannis Kompatsiaris; Ioannis Patras; Symeon Papadopoulos

Institution

CERTH - Center for Research and Technology Hellas Czech Technical University in Prague; Queen Mary University of London;

Abstract

We introduce S²VS, a video similarity learning approach with self-supervision. Self-Supervised Learning (SSL) is typically used to train deep models on a proxy task so as to have strong transferability on target tasks after fine-tuning. Here, in contrast to prior work, SSL is used to perform video similarity learning and address multiple retrieval and detection tasks at once with no use of labeled data. This is achieved by learning via instance-discrimination with task-tailored augmentations and the widely used InfoNCE loss together with an additional loss operating jointly on self-similarity and hard-negative similarity. We benchmark our method on tasks where video relevance is defined with varying granularity, ranging from video copies to videos depicting the same incident or event. We learn a single universal model that achieves state-of-the-art performance on all tasks, surpassing previously proposed methods that use labeled data. The code and pretrained models are publicly available at: https://github.com/gkordo/s2vs.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

18/06/2023

MaskCon: Masked Contrastive Learning for Coarse-Labelled Dataset

access here

Author(s)

Chen Feng; Ioannis Patras;

Institution

Queen Mary University of London;

Abstract

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

18/06/2023

Neuromorphic Event-based Facial Expression Recognition

access here

Author(s)

Alberto Del Bimbo; Andrea Leonardo Chiara Albisani Federico Becattini; Lisa Cresti Lorenzo Berlincioni; Luca Cultrera; Sara Picchioni

Institution

Università degli Studi di Firenze; University of Florence;

Abstract

Recently, event cameras have shown large applicability in several computer vision fields especially concerning tasks that require high temporal resolution. In this work, we investigate the usage of such kind of data for emotion recognition by presenting NEFER, a dataset for Neuromorphic Event-based Facial Expression Recognition. NEFER is composed of paired RGB and event videos representing human faces labeled with the respective emotions and also annotated with face bounding boxes and facial landmarks. We detail the data acquisition process as well as providing a baseline method for RGB and event data. The collected data captures subtle micro-expressions, which are hard to spot with RGB data, yet emerge in the event domain. We report a double recognition accuracy for the event-based approach, proving the effectiveness of a neuromorphic approach for analyzing fast and hardly detectable expressions and the emotions they conceal.

Access

Open Access

Type of Publication

Conference paper

Publisher

Computer Vision Foundation

access here

16/06/2023

Knowledge-driven Active Learning

access here

Author(s)

Alessandrio Betti Frédéric Precioso Gabriele Ciravegna Kevin Mottin Marco Gori

Institution

Politecnico di Torino Université Côte d'Azur; University of Siena

Abstract

The deployment of Deep Learning (DL) models is still precluded in those contexts where the amount of supervised data is limited.
To answer this issue, active learning strategies aim at minimizing the amount of labelled data required to train a DL model. Most active strategies are based on uncertain sample selection, and even often restricted to samples lying close to the decision boundary. These techniques are theoretically sound, but an understanding of the selected samples based on
their content is not straightforward, further driving non-experts to consider DL as a black-box. For the first time, here we propose to take into consideration common domain-knowledge and enable non-expert users to train a model with fewer samples. In our Knowledge-driven Active Learning (KAL) framework, rule-based knowledge is converted into logic constraints and their violation is checked as a natural guide for sample selection. We show that even simple relationships among data and output classes offer a way to spot predictions for which the model need supervision. We empirically show that KAL (i) outperforms many active learning strategies, particularly in those contexts where domain knowledge is rich, (ii) it discovers data distribution lying far from the initial training data, (iii) it ensures domain experts that the provided knowledge is acquired by the model, (iv) it is suitable for regression and object recognition tasks unlike uncertainty-based strategies, and (v) its computational demand is low.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

15/06/2023

Escaping local minima in deep reinforcement learning for video summarization

access here

Author(s)

Fabrizio Falchi; Jan Sedmidubsky; Nicola Messina; Tomás Rebok;

Institution

ISTI-CNR; Masaryk University;

Abstract

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

12/06/2023

Improving Synthetically Generated Image Detection in Cross-Concept Settings

access here

Author(s)

Giorgios Kordopatis-Zilos; Ioannis Kompatsiaris; Pantelis Dogoulis; Symeon Papadopoulos

Institution

CERTH - Center for Research and Technology Hellas

Abstract

New advancements for the detection of synthetic images are critical for fighting disinformation, as the capabilities of generative AI models continuously evolve and can lead to hyper-realistic synthetic imagery at unprecedented scale and speed. In this paper, we focus on the challenge of generalizing across different concept classes, e.g., when training a detector on human faces and testing on synthetic animal images — highlighting the ineffectiveness of existing approaches that randomly sample generated images to train their models. By contrast, we propose an approach based on the premise that the robustness of the detector can be enhanced by training it on realistic synthetic images that are selected based on their quality scores according to a probabilistic quality estimation model. We demonstrate the effectiveness of the proposed approach by conducting experiments with generated images from two seminal architectures, StyleGAN2 and Latent Diffusion, and using three different concepts for each, so as to measure the cross-concept generalization ability. Our results show that our quality-based sampling method leads to higher detection performance for nearly all concepts, improving the overall effectiveness of the synthetic image detectors.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

12/06/2023

MAD ’23 Workshop: Multimedia AI against Disinformation

access here

Author(s)

Adrian Popescu; Bogdan Ionescu; Giorgios Kordopatis-Zilos; Luca Cuccovillo; Symeon Papadopoulos

Institution

CERTH - Center for Research and Technology Hellas Czech Technical University in Prague; Fraunhofer IDMT; Université Paris-Saclay; University Politehnica of Bucharest

Abstract

With recent advancements in synthetic media manipulation and generation, verifying multimedia content posted online has become increasingly difficult. Additionally, the malicious exploitation of AI technologies by actors to disseminate disinformation on social media, and more generally the Web, at an alarming pace poses significant threats to society and democracy. Therefore, the development of AI-powered tools that facilitate media verification is urgently needed. The MAD ’23 workshop aims to bring together individuals working on the wider topic of detecting disinformation in multimedia to exchange their experiences and discuss innovative ideas, attracting people with varying backgrounds and expertise. The research areas of interest include identifying manipulated and synthetic content in multimedia, as well as examining the dissemination of disinformation and its impact on society. The multimedia aspect is very important since content most often contains a mix of modalities and their joint analysis can boost the performance of verification methods.

Access

Open Access

Type of Publication

Conference paper

Publisher

Conference on Multimedia Retrieval

access here

12/06/2023

[Front Matter] MAD ’23: Proceedings of the 2nd ACM International Workshop on Multimedia AI against Disinformation

access here

Author(s)

Adrian Popescu; Bogdan Ionescu; Giorgios Kordopatis-Zilos; Luca Cuccovillo; Symeon Papadopoulos

Institution

CERTH - Center for Research and Technology Hellas Czech Technical University in Prague; Fraunhofer IDMT; Université Paris-Saclay; University Politehnica of Bucharest

Abstract

Front matter of the proceedings of the 2nd ACM International Workshop on Multimedia AI against Disinformation, held in Thessaloniki (Greece) on June 12th, 2023. The full proceedings are available online at https://doi.org/10.1145/3591106.

Access

Open Access

Type of Publication

Book section

Publisher

International Workshop on Multimedia AI against Disinformation

access here

12/06/2023

VISIONE: a large-scale video retrieval system with advanced search functionalities

access here

Author(s)

Claudio Gennaro; Claudio Vairo; Fabio Carrara; Fabrizio Falchi; Giuseppe Amato; Lucia Vadicamo; Nicola Messina; Paolo Bolettieri;

Institution

ISTI-CNR;

Abstract

VISIONE is a large-scale video retrieval system that integrates multiple search functionalities, including free text search, spatial color and object search, visual and semantic similarity search, and temporal search. The system leverages cutting-edge AI technology for visual analysis and advanced indexing techniques to ensure scalability. As demonstrated by its runner-up position in the 2023 Video Browser Showdown competition, VISIONE effectively integrates these capabilities to provide a comprehensive video retrieval solution. A system demo is available online, showcasing its capabilities on over 2300 hours of diverse video content (V3C1+V3C2 dataset) and 12 hours of highly redundant content (Marine dataset). The demo can be accessed at https://visione.isti.cnr.it.

Access

Open Access

Type of Publication

Conference paper

Publisher

Conference on Multimedia Retrieval

access here

09/06/2023

SMART360: Simulating Motion prediction and Adaptive bitRate sTrategies for 360° video streaming

access here

Author(s)

Lucile Sassatelli; Quentin Guimard

Institution

Université Côte d'Azur;

Abstract

Adaptive bitrate (ABR) algorithms are used in streaming media to adjust video or audio quality based on the viewer’s network conditions to provide a smooth playback experience. With the rise of virtual reality (VR) headsets, 360° video streaming is growing rapidly and requires efficient ABR strategies to also adapt the video quality to the user’s head position. However, research in this field is often difficult to compare due to a lack of reproducible simulations. To address this problem, we provide SMART360, a 360° streaming simulation environment to compare motion prediction and adaptive bitrates strategies.
We provide sample inputs and baseline algorithms along with the simulator, as well as examples of results and visualizations that can be obtained with SMART360. The code and data are made publicly available.

Access

Open Access

Type of Publication

Conference paper

Publisher

ACM Multimedia Systems Conference

access here

01/06/2023

Logit Margin Matters: Improving Transferable Targeted Adversarial Attack by Logit Calibration

access here

Author(s)

Juanjuan Weng; Nicu Sebe; Shaozi Li; Zhiming Luo; Zhun Zhong

Institution

University of Trento; Xiamen University

Abstract

Access

Open Access

Type of Publication

Publisher

Transactions on Information Forensics and Security

access here

01/06/2023

Graph Transformer GANs for Graph-Constrained House Generation

access here

Author(s)

Hao Tang;

Institution

ETH Zurich; Tencent AI Lab University of Oregon;

Abstract

Access

Open Access

Type of Publication

Conference paper

Publisher

Computer Vision and Pattern Recognition

access here

01/06/2023

Dynamic Conceptional Contrastive Learning for Generalized Category Discovery

access here

Author(s)

Nan Pu; Nicu Sebe; Zhun Zhong

Institution

University of Trento;

Abstract

Access

Open Access

Type of Publication

Conference paper

Publisher

Computer Vision and Pattern Recognition

access here

01/06/2023

Masked Jigsaw Puzzle: A Versatile Position Embedding for Vision Transformers

access here

Author(s)

Bin Ren; Nicu Sebe; Rita Cucchiara; Wei Bi; Wei Wang; Yahui Liu Yue Song

Institution

Beijing Jiaotong University; Tencent AI Lab University of Modena and Reggio Emilia; University of Trento;

Abstract

Access

Open Access

Type of Publication

Conference paper

Publisher

Computer Vision and Pattern Recognition

access here

01/06/2023

Dynamically Instance-Guided Adaptation: A Backward-free Approach for Test-Time Domain Adaptive Semantic Segmentation

access here

Author(s)

Boyu Wang; Charles Ling; Nicu Sebe; Wei Wang; Weijie Wang; Xi Chen; Zhun Zhong

Institution

Huawei Noah's Ark Lab; University of Trento; Western University;

Abstract

Access

Open Access

Type of Publication

Conference paper

Publisher

Computer Vision and Pattern Recognition

access here

26/05/2023

Design Space Exploration of Shell Structures Using Quality Diversity Algorithms

access here

Author(s)

Antonios Liapis; Edoardo Tibuzzi; Georgios N. Yannakakis; Jeg Dudley; Joel Hilmersson; Konstantinos Sfikas;

Institution

AKT II; University of Malta

Abstract

Computer-aided optimization algorithms in structural engineering have historically focused on the structural performance of generated forms, often resulting in the selection of a single ‘optimal’ solution. However, diversity of generated solutions is desirable when those solutions are shown to a human user to choose from. Quality-Diversity (QD) search is an emerging field of Evolutionary Computation which can automate the exploration of the solution space in engineering problems. QD algorithms, such as MAP-Elites, operate by maintaining and expanding an archive of diverse solutions, optimising for quality in local niches of a multidimensional design space. The generated archive of solutions can help engineers gain a better overview of the solution space, illuminating which designs are possible and their trade-offs. In this paper we apply Quality Diversity search to the problem of designing shell structures. Since the design of shell structures comes with physical constraints, we leverage a constrained optimization variant of the MAP-Elites algorithm, FI-MAP-Elites. We implement our proposed methodology within the Rhino/Grasshopper environment and use the Karamba Finite Element Analysis solver for all structural engineering calculations. We test our method on case studies of parametric models of shell structures that feature varying complexity. Our experiments investigate the algorithm’s ability to illuminate the solution space and generate feasible and high-quality solutions.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

23/05/2023

Parts of Speech-Grounded Subspaces in Vision-Language Models

access here

Author(s)

Christos Tzelepis; Ioannis Panagakis Ioannis Patras; James Oldfield; Mihalis Nicolaou;

Institution

Cyprus Institute; National and Kapodistrian University of Athens Queen Mary University of London;

Abstract

Latent image representations arising from vision-language models have proved immensely useful for a variety of downstream tasks. However, their utility is limited by their entanglement with respect to different visual attributes. For instance, recent work has shown that CLIP image representations are often biased toward specific visual properties (such as objects or actions) in an unpredictable manner. In this paper, we propose to separate representations of the different visual modalities in CLIP’s joint vision-language space by leveraging the association between parts of speech and specific visual modes of variation (e.g. nouns relate to objects, adjectives describe appearance). This is achieved by formulating an appropriate component analysis model that learns subspaces capturing variability corresponding to a specific part of speech, while jointly minimising variability to the rest. Such a subspace yields disentangled representations of the different visual properties of an image or text in closed form while respecting the underlying geometry of the manifold on which the representations lie. What’s more, we show the proposed model additionally facilitates learning subspaces corresponding to specific visual appearances (e.g. artists’ painting styles), which enables the selective removal of entire visual themes from CLIP-based text-to-image synthesis. We validate the model both qualitatively, by visualising the subspace projections with a text-to-image model and by preventing the imitation of artists’ styles, and quantitatively, through class invariance metrics and improvements to baseline zero-shot classification.

Access

Open Access

Type of Publication

Conference paper

Publisher

Conference on Neural Information Processing Systems

access here

19/05/2023

You’ve Been Framed – Partial Audio Matching Functionality to Support Framing Analysis

access here

Author(s)

Johan Oomen; Philo van Kemenade; Rasa Bocyte

Institution

Netherlands Institute for Sound & Vision

Abstract

Segments of audiovisual content are constantly being reinterpreted as they are reused and repeated in new contexts. Framing analysis can reveal patterns and biases in the way content is being recontextualised in the media to shape public discourse. In the AI4Media project, the Netherlands Institute for Sound & Vision has been investigating how AI-based tools could support humanities scholars in performing framing analysis across large-scale audiovisual collections. This short paper describes a demo of the Partial Audio Matching (PAM) functionality designed for this purpose. It describes how PAM has been integrated into the CLARIAH Media Suite – a virtual research space for humanities scholars that enables the exploration and analysis of audiovisual collections.

Access

Open Access

Type of Publication

Report

Publisher

N/A

access here

16/05/2023

Raising User Awareness about the Consequences of Online Photo Sharing

access here

Author(s)

Adrian Popescu; Hugo Schindler; Jérôme Deshayes-Chossart; Van-Khoa Nguyen

Institution

Université Paris-Saclay; University of Geneva;

Abstract

Online social networks use AI techniques to automatically infer profiles from users’ shared data.
However, these inferences and their effects remain, to a large extent, opaque to the users themselves.
We propose a method which raises user awareness about the potential use of their profiles in impactful situations, such as searching for a job or an accommodation.
These situations illustrate usage contexts that users might not have anticipated when deciding to share their data.
User photographic profiles are described by automatic object detections in profile photos, and associated object ratings in situations.
Human ratings of the profiles per situation are also available for training.
These data are represented as graph structures which are fed into graph neural networks in order to learn how to automatically rate them.
An adaptation of the learning procedure per situation is proposed since the same profile is likely to be interpreted differently, depending on the context.
Automatic profile ratings are compared to one another in order to inform individual users of their standing with respect to others.
Our method is evaluated on a public dataset, and consistently outperforms competitive baselines.
An ablation study gives insights about the role of its main components.

Access

Open Access

Type of Publication

Conference paper

Publisher

Conference on Multimedia Retrieval

access here

15/05/2023

Recurrent Vision Transformer for Solving Visual Reasoning Problems

access here

Author(s)

Claudio Gennaro; Fabio Carrara; Fabrizio Falchi; Giuseppe Amato; Nicola Messina;

Institution

ISTI-CNR;

Abstract

Access

Open Access

Type of Publication

Conference paper

Publisher

Conference on Image Analysis and Processing

access here

08/05/2023

New Bulgarian Resources for Studying Deception and Detecting Disinformation

access here

Author(s)

Hristiana Krasteva; Irina Temnikova; Ivo Dzhumerov; Ruslana Margova; Tsvetelina Stefanova; Veneta Kireva;

Institution

GATE Institute;

Abstract

Automatically detecting disinformation is an important Natural Language Processing (NLP) task whose results can assist journalists and the general public. The European Commission defines “disinformation” as “false or misleading content that is spread with an intention to deceive”. Deception and thus disinformation can be identified by the presence of (psycho)linguistic markers, but some lower-resourced languages (e.g. Bulgarian) lack sufficient linguistic and psycholinguistic research on this topic, lists of such markers and suitable datasets. This article introduces the first ever resources for studying and detecting deception and disinformation in Bulgarian (some of which can be adapted to other languages). The resources can benefit linguists, psycholinguists and NLP researchers, are accessible on Zenodo (subject to legal conditions) and include: 1) an extended hierarchical classification of linguistic markers signalling deception; 2) lists of Bulgarian expressions for recognizing some of the linguistic markers; 3) four large Bulgarian social media datasets on topics related to deception, not fact-checked, but automatically annotated with the markers; 4) Python scripts to automatically collect, clean, anonymize, and annotate new Bulgarian texts. The datasets can be used to build machine learning methods or study potential deception. The article describes the methods of collecting and processing the datasets and linguistic markers, and presents some statistics.

Access

Open Access

Type of Publication

Conference paper

Publisher

Language & Technology Conference

access here

01/05/2023

Multi-Channel Attention Selection GANs for Guided Image-to-Image Translation

access here

Author(s)

Hao Tang; Nicu Sebe; Philip Torr;

Institution

ETH Zurich; University of Oxford; University of Trento;

Abstract

Access

Closed Access

Type of Publication

Journal article

Publisher

IEEE Transaction on Pattern Analysis and Machine Intelligence

access here

30/04/2023

Edge Guided GANs with Contrastive Learning for Semantic Image Synthesis

access here

Author(s)

Dan Xu; Guolei Sun; Hao Tang; Luc van Gool; Nicu Sebe; Radu Timofte; Xiaojuan Qi;

Institution

ETH Zurich; HKUST; University of Hong Kong; University of Trento; University of Wurzburg;

Abstract

We propose a novel edge guided generative adversarial network with contrastive learning (ECGAN) for the challenging semantic image synthesis task. Although considerable improvement has been achieved, the quality of synthesized images is far from satisfactory due to three largely unresolved challenges. 1) The semantic labels do not provide detailed structural information, making it difficult to synthesize local details and structures. 2) The widely adopted CNN operations such as convolution, down-sampling, and normalization usually cause spatial resolution loss and thus cannot fully preserve the original semantic information, leading to semantically inconsistent results (e.g., missing small objects). 3) Existing semantic image synthesis methods focus on modeling “local” semantic information from a single input semantic layout. However, they ignore “global” semantic information of multiple input semantic layouts, i.e., semantic cross-relations between pixels across different input layouts. To tackle 1), we propose to use edge as an intermediate representation which is further adopted to guide image generation via a proposed attention guided edge transfer module. Edge information is produced by a convolutional generator and introduces detailed structure information. To tackle 2), we design an effective module to selectively highlight class-dependent feature maps according to the original semantic layout to preserve the semantic information. To tackle 3), inspired by current methods in contrastive learning, we propose a novel contrastive learning method, which aims to enforce pixel embeddings belonging to the same semantic class to generate more similar image content than those from different classes. Doing so can capture more semantic relations by explicitly exploring the structures of labeled pixels from multiple input semantic layouts. Experiments on three challenging datasets show that our ECGAN achieves significantly better results than state-of-the-art methods.

Access

Open Access

Type of Publication

Conference paper

Publisher

International Conference on Learning Representations

access here

24/04/2023

Edge Guided GANs with Contrastive Learning for Semantic Image Synthesis

access here

Author(s)

Dan Xu; Guolei Sun; Hao Tang; Luc van Gool; Nicu Sebe; Radu Timofte; Xiaojuan Qi;

Institution

ETH Zurich; HKUST; University of Hong Kong; University of Trento; University of Wurzburg;

Abstract

Access

Open Access

Type of Publication

Conference paper

Publisher

International Conference on Learning Representations

access here

24/04/2023

PI-Trans: Parallel-ConvMLP and Implicit-Transformation Based GAN for Cross-View Image Translation

access here

Author(s)

Bin Ren; Hao Tang; Nicu Sebe; Wei Wang; Xia Li; Yiming Wang;

Institution

Beijing Jiaotong University; ETH Zurich; FBK; University of Trento;

Abstract

Access

Closed Access

Type of Publication

Conference paper

Publisher

Speech and Signal Processing

access here

21/04/2023

Unsupervised Domain Adaptation for Video Violence Detection in the Wild

access here

Author(s)

Carlos Santiago; Claudio Gennaro; Giuseppe Amato; João Paulo Costeira; Luca Ciampi;

Institution

Instituto Superior Técnico; ISTI-CNR;

Abstract

Video violence detection is a subset of human action recognition aiming to detect violent behaviors in trimmed video clips. Current Computer Vision solutions based on Deep Learning approaches provide astonishing results. However, their success relies on large collections of labeled datasets for supervised learning to guarantee that they generalize well to diverse testing scenarios. Although plentiful annotated data may be available for some pre-specified domains, manual annotation is unfeasible for every ad-hoc target domain or task. As a result, in many real-world applications, there is a domain shift between the distributions of the train (source) and test (target) domains, causing a significant drop in performance at inference time. To tackle this problem, we propose an Unsupervised Domain Adaptation scheme for video violence detection based on single image classification that mitigates the domain gap between the two domains. We conduct experiments considering as the source labeled domain some datasets containing violent/non-violent clips in general contexts and, as the target domain, a collection of videos specific for detecting violent actions in public transport, showing that our proposed solution can improve the performance of the considered models.

Access

Open Access

Type of Publication

Conference paper

Publisher

Conference on Image Processing and Vision Engineering

access here

12/04/2023

Multiplayer Tension In the Wild: A Hearthstone Case

access here

Author(s)

Antonios Liapis; David Melhart; Georgios N. Yannakakis; Paris Mavromoustakos-Blom; Pieter Spronck; Sander Bakkes;

Institution

Tilburg University; University of Malta Utrecht University;

Abstract

Games are designed to elicit strong emotions during game play, especially when players are competing against each other. Artificial Intelligence applied to predict a player’s emotions has mainly been tested on single-player experiences in low-stakes settings and short-term interactions. How do players experience and manifest affect in high-stakes competitions, and which modalities can capture this? This paper reports a first experiment in this line of research, using a competition of the video game Hearthstone where both competing players’ game play and facial expressions were recorded over the course of the entire match which could span up to 41 minutes. Using two experts’ annotations of tension using a continuous video affect annotation tool, we attempt to predict tension from the webcam footage of the players alone. Treating both the input and the tension output in a relative fashion, our best models reach 66.3% average accuracy (up to 79.2% at the best fold) in the challenging leave-one-participant out cross-validation task. This initial experiment shows a way forward for affect annotation in games “in the wild” in high-stakes, real-world competitive settings.

Access

Open Access

Type of Publication

Conference paper

Publisher

Conference on the Foundations of Digital Games

access here

06/04/2023

From the Lab to the Wild: Affect Modeling Via Privileged Information

access here

Author(s)

Antonios Liapis; Georgios N. Yannakakis; Konstantinos Makantasis; Kosmas Pinitas;

Institution

University of Malta

Abstract

How can we reliably transfer affect models trained in controlled laboratory conditions ( in-vitro ) to uncontrolled real-world settings ( in-vivo )? The information gap between in-vitro and in-vivo applications defines a core challenge of affective computing. This gap is caused by limitations related to affect sensing including intrusiveness, hardware malfunctions and availability of sensors. As a response to these limitations, we introduce the concept of privileged information for operating affect models in real-world scenarios (in the wild). Privileged information enables affect models to be trained across multiple modalities available in a lab, and ignore, without significant performance drops, those modalities that are not available when they operate in the wild. Our approach is tested in two multimodal affect databases one of which is designed for testing models of affect in the wild. By training our affect models using all modalities and then using solely raw footage frames for testing the models, we reach the performance of models that fuse all available modalities for both training and testing. The results are robust across both classification and regression affect modeling tasks which are dominant paradigms in affective computing. Our findings make a decisive step towards realizing affect interaction in the wild.

Access

Open Access

Type of Publication

Journal article

Publisher

IEEE Transactions on Affective Computing

access here

05/04/2023

Perceptual Quality Improvement in Videoconferencing Using Keyframes-Based GAN

access here

Author(s)

Alberto Del Bimbo; Leonardo Galteri; Lorenzo Agnolucci Marco Bertini;

Institution

University of Florence;

Abstract

In the latest years, videoconferencing has taken a fundamental role in interpersonal relations, both for personal and business purposes. Lossy video compression algorithms are the enabling technology for videoconferencing, as they reduce the bandwidth required for real-time video streaming. However, lossy video compression decreases the perceived visual quality. Thus, many techniques for reducing compression artifacts and improving video visual quality have been proposed in recent years. In this work, we propose a novel GAN-based method for compression artifacts reduction in videoconferencing. Given that, in this context, the speaker is typically in front of the camera and remains the same for the entire duration of the transmission, we can maintain a set of reference keyframes of the person from the higher-quality I-frames that are transmitted within the video stream and exploit them to guide the visual quality improvement; a novel aspect of this approach is the update policy that maintains and updates a compact and effective set of reference keyframes. First, we extract multi-scale features from the compressed and reference frames. Then, our architecture combines these features in a progressive manner according to facial landmarks. This allows the restoration of the high-frequency details lost after the video compression. Experiments show that the proposed approach improves visual quality and generates photo-realistic results even with high compression rates.

Access

Open Access

Type of Publication

Journal article

Publisher

IEEE Transactions on Multimedia

access here

03/04/2023

DivClust: Controlling Diversity in Deep Clustering

access here

Author(s)

Georgios Tzimiropoulos; Ioannis Maniadis Metaxas; Ioannis Patras;

Institution

Queen Mary University of London;

Abstract

Clustering has been a major research topic in the field of machine learning, one to which Deep Learning has recently been applied with significant success. However, an aspect of clustering that is not addressed by existing deep clustering methods, is that of efficiently producing multiple, diverse partitionings for a given dataset. This is particularly important, as a diverse set of base clusterings are necessary for consensus clustering, which has been found to produce better and more robust results than relying on a single clustering. To address this gap, we propose DivClust, a diversity controlling loss that can be incorporated into existing deep clustering frameworks to produce multiple clusterings with the desired degree of diversity. We conduct experiments with multiple datasets and deep clustering frameworks and show that: a) our method effectively controls diversity across frameworks and datasets with very small additional computational cost, b) the sets of clusterings learned by DivClust include solutions that significantly outperform single-clustering baselines, and c) using an off-the-shelf consensus clustering algorithm, DivClust produces consensus clustering solutions that consistently outperform single-clustering baselines, effectively improving the performance of the base deep clustering framework.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

01/04/2023

AttentionGAN: Unpaired Image-to-Image Translation Using Attention-Guided Generative Adversarial Networks

access here

Author(s)

Dan Xu; Hao Tang; Hong Liu; Nicu Sebe; Philip Torr;

Institution

ETH Zurich; Hong Kong University of Science and Technology; Peking University; University of Oxford; University of Trento;

Abstract

State-of-the-art methods in the image-to-image translation are capable of learning a mapping from a source domain to a target domain with unpaired image data. Though the existing methods have achieved promising results, they still produce visual artifacts, being able to translate low-level information but not high-level semantics of input images. One possible reason is that generators do not have the ability to perceive the most discriminative parts between the source and target domains, thus making the generated images low quality. In this article, we propose a new Attention-Guided Generative Adversarial Networks (AttentionGAN) for the unpaired image-to-image translation task. AttentionGAN can identify the most discriminative foreground objects and minimize the change of the background. The attention-guided generators in AttentionGAN are able to produce attention masks, and then fuse the generation output with the attention masks to obtain high-quality target images. Accordingly, we also design a novel attention-guided discriminator which only considers attended regions. Extensive experiments are conducted on several generative tasks with eight public datasets, demonstrating that the proposed method is effective to generate sharper and more realistic images compared with existing competitive models. The code is available at https://github.com/Ha0Tang/AttentionGAN.

Access

Closed Access

Type of Publication

Journal article

Publisher

IEEE Transactions on Neural Networks and Learning Systems

access here

01/04/2023

Mitigating Robust Overfitting via Self-Residual-Calibration Regularization

access here

Author(s)

Hong Liu; Nicu Sebe; Sin'ichi Satoh; Zhun Zhong

Institution

National Institute of Informatics of Tokyo; University of Trento;

Abstract

Overfitting in adversarial training has attracted the interest of researchers in the community of artificial intelligence and machine learning in recent years. To address this issue, in this paper we begin by evaluating the defense performances of several calibration methods on various robust models. Our analysis and experiments reveal two intriguing properties: 1) a well-calibrated robust model is decreasing the confidence of robust model; 2) there is a trade-off between the confidences of natural and adversarial images. These new properties offer a straightforward insight into designing a simple but effective regularization, called Self-Residual-Calibration (SRC). The proposed SRC calculates the absolute residual between adversarial and natural logit features corresponding to the ground-truth labels. Furthermore, we utilize the pinball loss to minimize the quantile residual between them, resulting in more robust regularization. Extensive experiments indicate that our SRC can effectively mitigate the overfitting problem while improving the robustness of state-of-the-art models. Importantly, SRC is complementary to various regularization methods. When combined with them, we are capable of achieving the top-rank performance on the AutoAttack benchmark leaderboard.

Access

Closed Access

Type of Publication

Journal article

Publisher

Artificial Intelligence

access here

01/04/2023

Controllable Exploration of a Design Space via Interactive Quality Diversity

access here

Author(s)

Antonios Liapis; Georgios N. Yannakakis; Konstantinos Sfikas;

Institution

University of Malta

Abstract

This paper introduces a user-driven evolutionary algorithm based on Quality Diversity (QD) search. During a design session, the user iteratively selects among presented alternatives and their selec- tions affect the upcoming results. We implement a variation of the MAP-Elites algorithm where the presented alternatives are sampled from a small region (window) of the behavioral space. After a user selection, the window is centered on the selected individual’s be- havior characterization, evolution selects parents from within this window to produce offspring, and new alternatives are sampled. Essentially we define an adaptive system of local QD search, where the user’s selections guide the search towards specific regions of the behavioral space. The system is tested on the generation of architectural layouts, a constrained optimization task, leveraging QD search through a two-archive approach.

Access

Open Access

Type of Publication

Conference paper

Publisher

Genetic and Evolutionary Computation Conference

access here

01/04/2023

Controllable Exploration of a Design Space via Interactive Quality Diversity

access here

Author(s)

Antonios Liapis; Georgios N. Yannakakis; Konstantinos Sfikas;

Institution

University of Malta

Abstract

Access

Open Access

Type of Publication

Conference paper

Publisher

Genetic and Evolutionary Computation Conference

access here

28/03/2023

Report on the 3rd International Workshop on Learning to Quantify (LQ 2023)

access here

Author(s)

Alejandro Moreo; Fabrizio Sebastiani; Mirko Bunse; Pablo González

Institution

ISTI-CNR; University of Applied Sciences and Art Dortmund; University of Oviedo

Abstract

The 3rd International Workshop on Learning to Quantify (LQ 2023) took place on September 18, 2023 in Torino, IT, where it was organised as a satellite event of the 34th European Conference on Machine Learning and Principles and Practice of Knowledg Discovery in Databases (ECML PKDD 2023). Like the main program of the conference, the workshop employed a hybrid format, with all presentations given in presence and with attendees participating in presence or online. This report presents a summary of the workshop, briefly summarising the individual works presented, and touching on the main issues that emerged during the final, open discussion.

Access

Open Access

Type of Publication

Journal article

Publisher

SIGKDD Explorations

access here

21/03/2023

Evaluation of Off-the-Shelf Language Identification Tools on Bulgarian Social Media Posts

access here

Author(s)

Hristiana Nikolaeva; Irina Temnikova; Ivo Dzhumerov; Silvia Gargova;

Institution

GATE Institute; Plovdic University;

Abstract

Automatic Language Identification (LI) is a widely addressed task, but not all users (for example linguists) have the means or interest to develop their own tool or to train the existing ones with their own data. There are several off-the-shelf LI tools, but for some languages, it is unclear which tool is the best for specific types of text. This article presents a comparison of the performance of several off-the-shelf language identification tools on Bulgarian social media data. The LI tools are tested on a multilingual Twitter dataset (composed of 2966 tweets) and an existing Bulgarian Twitter dataset on the topic of fake content detection of 3350 tweets. The article presents the manual annotation procedure of the first dataset, a dis- cussion of the decisions of the two annotators, and the results from testing the 7 off-the-shelf LI tools on both datasets. Our findings show that the tool, which is the easiest for users with no programming skills, achieves the highest F1-Score on Bulgarian social media data, while other tools have very useful functionalities for Bulgarian social media texts.

Access

Open Access

Type of Publication

Conference paper

Publisher

Conference on Computational Linguistics

access here

17/03/2023

SegmentCodeList: Unsupervised Representation Learning for Human Skeleton Data Retrieval

access here

Author(s)

Fabio Carrara; Giuseppe Amato; Jan Sedmidubsky;

Institution

ISTI-CNR; Masaryk University;

Abstract

Recent progress in pose-estimation methods enables the extraction of sufficiently-precise 3D human skeleton data from ordinary videos, which offers great opportunities for a wide range of applications. However, such spatio-temporal data are typically extracted in the form of a continuous skeleton sequence without any information about semantic segmentation or annotation. To make the extracted data reusable for further processing, there is a need to access them based on their content. In this paper, we introduce a universal retrieval approach that compares any two skeleton sequences based on temporal order and similarities of their underlying segments. The similarity of segments is determined by their content-preserving low-dimensional code representation that is learned using the Variational AutoEncoder principle in an unsupervised way. The quality of the proposed representation is validated in retrieval and classification scenarios; our proposal outperforms the state-of-the-art approaches in effectiveness and reaches speed-ups up to 64x on common skeleton sequence datasets.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

13/03/2023

CoReS: Compatible Representations via Stationarity

access here

Author(s)

Alberto Del Bimbo; Federico Pernici; Matteo Bruni; Niccolò Biondi

Institution

University of Florence;

Abstract

Compatible features enable the direct comparison of old and new learned features allowing to use them interchangeably over time. In visual search systems, this eliminates the need to extract new features from the gallery-set when the representation model is upgraded with novel data. This has a big value in real applications as re-indexing the gallery-set can be computationally expensive when the gallery-set is large, or even infeasible due to privacy or other concerns of the application. In this paper, we propose CoReS, a new training procedure to learn representations that are compatible with those previously learned, grounding on the stationarity of the features as provided by fixed classifiers based on polytopes. With this solution, classes are maximally separated in the representation space and maintain their spatial configuration stationary as new classes are added, so that there is no need to learn any mappings between representations nor to impose pairwise training with the previously learned model.
We demonstrate that our training procedure largely outperforms the current state of the art and is particularly effective in the case of multiple upgrades of the training-set, which is the typical case in real applications.

Access

Open Access

Type of Publication

Journal article

Publisher

IEEE Transaction on Pattern Analysis and Machine Intelligence

access here

08/03/2023

On the Generalization of Deep Learning Models in Video Deepfake Detection

access here

Author(s)

Claudio Gennaro; Davide Alessandro Coccomini; Fabrizio Falchi; Roberto Caldelli

Institution

ISTI-CNR; Mercatorum University;

Abstract

The increasing use of deep learning techniques to manipulate images and videos, commonly referred to as “deepfakes,” is making more and more challenging to differentiate between real and 2 fake content. While various deepfake detection systems have been developed, they often struggle to detect deepfakes in real-world situations. In particular, these methods are often unable to effectively distinguish images or videos when these are modified using novel techniques which have not been used in the training set. In this study, we carry out an analysis of different deep learning architectures in an attempt to understand which is more capable of better generalizing the concept of deepfake. According to our results, it appears that Convolutional Neural Networks (CNNs) seem to be more capable of storing specific anomalies and thus excel in cases of datasets with a limited number of elements and manipulation methodologies. The Vision Transformer, conversely, is more effective when trained with more varied datasets, achieving more outstanding generalization capabilities than the other methods analysed. Finally, the Swin Transformer appears to be a good alternative for using an attention-based method in a more limited data regime and performs very well in cross-dataset scenarios. All the analyzed architectures seem to have a different way to look at deepfakes but since in a real-world environment, the generalization capability is essential, based on the carried out experiments the attention-based architectures seem to provide superior performances.

Access

Open Access

Type of Publication

Journal article

Publisher

N/A

access here

01/03/2023

Bipartite Graph Reasoning GANs for Person Pose and Facial Image Synthesis

access here

Author(s)

Hao Tang; Ling Shao; Nicu Sebe; Philip Torr;

Institution

ETH Zurich; Terminus AI Lab; University of Oxford; University of Trento;

Abstract

We present a novel bipartite graph reasoning Generative Adversarial Network (BiGraphGAN) for two challenging tasks: person pose and facial image synthesis. The proposed graph generator consists of two novel blocks that aim to model the pose-to-pose and pose-to-image relations, respectively. Specifically, the proposed bipartite graph reasoning (BGR) block aim to reason the long-range cross relations between the source and target pose in a bipartite graph, which mitigates some of the challenges caused by pose deformation. Moreover, we propose a new interaction-and-aggregation (IA) block to effectively update and enhance the feature representation capability of both a person’s shape and appearance in an interactive way. To further capture the change in pose of each part more precisely, we propose a novel part-aware bipartite graph reasoning (PBGR) block to decompose the task of reasoning the global structure transformation with a bipartite graph into learning different local transformations for different semantic body/face parts. Experiments on two challenging generation tasks with three public datasets demonstrate the effectiveness of the proposed methods in terms of objective quantitative scores and subjective visual realness. The source code and trained models are available at https://github.com/Ha0Tang/BiGraphGAN.

Access

Closed Access

Type of Publication

Journal article

Publisher

International Journal of Computer Vision

access here

01/03/2023

Deep Unsupervised Key Frame Extraction for Eficient Video Classification

access here

Author(s)

Bin Ren; Hao Tang; Lei Ding; Nicu Sebe; Paolo Rota; Songsong Wu;

Institution

ETH Zurich; Guangdong University of Petrochemical Technology; University of Trento;

Abstract

Video processing and analysis have become an urgent task since a huge amount of videos (e.g., Youtube, Hulu) are uploaded online every day. The extraction of representative key frames from videos is very important in video processing and analysis since it greatly reduces computing resources and time. Although great progress has been made recently, large-scale video classification remains an open problem, as the existing methods have not well balanced the performance and eiciency simultaneously. To tackle this problem, this work presents an unsupervised method to retrieve the key frames, which combines Convolutional Neural Network (CNN) and Temporal Segment Density Peaks Clustering (TSDPC). The proposed TSDPC is a generic and powerful framework and it has two advantages compared with previous works, one is that it can calculate the number of key frames automatically. The other is that it can preserve the temporal information of the video. Thus it improves the eiciency of video classiication. Furthermore, a Long Short-Term Memory network (LSTM) is added on the top of the CNN to further elevate the performance of classiication. Moreover, a weight fusion strategy of diferent input networks is presented to boost the performance. By optimizing both video classiication and key frame extraction simultaneously, we achieve better classiication performance and higher eiciency.We evaluate our method on two popular datasets (i.e., HMDB51 and UCF101) and the experimental results consistently demonstrate that our strategy achieves competitive performance and eficiency compared with the state-of-the-art approaches.

Access

Closed Access

Type of Publication

Journal article

Publisher

ACM Transactions on Multimedia Computing, Communications, and Applications

access here

01/03/2023

Attribute-preserving Face Dataset Anonymization via Latent Code Optimization

access here

Author(s)

Christos Tzelepis; Ioannis Patras; Nicu Sebe; Simone Barattin;

Institution

Queen Mary University of London; University of Trento;

Abstract

This work addresses the problem of anonymizing the identity of faces in a dataset of images, such that the privacy of those depicted is not violated, while at the same time the dataset is useful for downstream task such as for training machine learning models. To the best of our knowledge, we are the first to explicitly address this issue and deal with two major drawbacks of the existing state-of-the-art approaches, namely that they (i) require the costly training of additional, purpose-trained neural networks, and/or (ii) fail to retain the facial attributes of the original images in the anonymized counterparts, the preservation of which is of paramount importance for their use in downstream tasks. We accordingly present a task-agnostic anonymization procedure that directly optimises the images’ latent representation in the latent space of a pre-trained GAN. By optimizing the latent codes directly, we ensure both that the identity is of a desired distance away from the original (with an identity obfuscation loss), whilst preserving the facial attributes (using a novel feature-matching loss in FaRL’s deep feature space). We demonstrate through a series of both qualitative and quantitative experiments that our method is capable of anonymizing the identity of the images whilst–crucially–better-preserving the facial attributes.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

19/02/2023

CrowdSim2: an Open Synthetic Benchmark for Object Detectors

access here

Author(s)

Adam Cygan; Agnieszka Szczesna; Bartosz Bizón; Dominik Golba; Elzbieta Macioszeck; Luca Ciampi; Michal Cogiel; Michal Staniszewski; Nicola Messina; Pawel Foszner;

Institution

Blees; ISTI-CNR; QSystem.pro; Silesian University of Technology

Abstract

Data scarcity has become one of the main obstacles to developing supervised models based on Artificial Intelligence in Computer Vision. Indeed, Deep Learning-based models systematically struggle when applied in new scenarios never seen during training and may not be adequately tested in non-ordinary yet crucial real-world situations. This paper presents and publicly releases CrowdSim2, a new synthetic collection of images suitable for people and vehicle detection gathered from a simulator based on the Unity graphical engine. It consists of thousands of images gathered from various synthetic scenarios resembling the real world, where we varied some factors of interest, such as the weather conditions and the number of objects in the scenes. The labels are automatically collected and consist of bounding boxes that precisely localize objects belonging to the two object classes, leaving out humans from the annotation pipeline. We exploited this new benchmark as a testing ground for some state-of-the-art detectors, showing that our simulated scenarios can be a valuable tool for measuring their performances in a controlled environment.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

19/02/2023

Development of a Realistic Crowd Simulation Environment for Fine-grained Validation of People Tracking Methods

access here

Author(s)

Adam Cygan; Agnieszka Szczesna; Bartosz Bizón; Dominik Golba; Elzbieta Macioszeck; Luca Ciampi; Michal Cogiel; Michal Staniszewski; Nicola Messina; Pawel Foszner;

Institution

Blees; ISTI-CNR; QSystem.pro; Silesian University of Technology

Abstract

Generally, crowd datasets can be collected or generated from real or synthetic sources. Real data is generated by using infrastructure-based sensors (such as static cameras or other sensors). The use of simulation tools can significantly reduce the time required to generate scenario-specific crowd datasets, facilitate data-driven research, and next build functional machine learning models. The main goal of this work was to develop an extension of crowd simulation (named CrowdSim2) and prove its usability in the application of people-tracking algorithms. The simulator is developed using the very popular Unity 3D engine with particular emphasis on the aspects of realism in the environment, weather conditions, traffic, and the movement and models of individual agents. Finally, three methods of tracking were used to validate generated dataset: IOU-Tracker, Deep-Sort, and Deep-TAMA.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

18/02/2023

Explainable Sparse Attention for Memory-Based Trajectory Predictors

access here

Author(s)

Alberto Del Bimbo; Federico Becattini; Francesco Marchetti; Lorenzo Seidenari;

Institution

Università degli Studi di Firenze; University of Florence;

Abstract

In this paper we address the problem of trajectory prediction, focusing on memory-based models. Such methods are trained to collect a set of useful samples that can be retrieved and used at test time to condition predictions. We propose Explainable Sparse Attention (ESA), a module that can be seamlessly plugged-in into several existing memory-based state of the art predictors. ESA generates a sparse attention in memory, thus selecting a small subset of memory entries that are relevant for the observed trajectory. This enables an explanation of the model’s predictions with reference to previously observed training samples. Furthermore, we demonstrate significant improvements on three trajectory prediction datasets.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

09/02/2023

A global taxonomy of interpretable AI: unifying the terminology for the technical and social sciences

access here

Author(s)

Adrien Depeursinge; Davide Calvaresi; Henning Müller; John O. Prior; José Pereira Amorim; Katerina Yordanova; Lidia Dutklewicz; Lode Lauwaert; Mara Graziani; Mor Vered; Pedro Henriques Abreu; Rahul Nair; Tobias Blanke; Valeria Pulignano; Vincent Andrearczy; Wessel Reijers;

Institution

European University Institute; Faculty of Social Science of Leuven; IPO - Porto Research Centre; Lausanne University Hospital; University of Amsterdam; University of Applied Sciences of Western Switzerland; University of Coimbra; University of Geneva;

Abstract

Since its emergence in the 1960s, Artificial Intelligence (AI) has grown to conquer many technology products and their fields of application. Machine learning, as a major part of the current AI solutions, can learn from the data and through experience to reach high performance on various tasks. This growing success of AI algorithms has led to a need for interpretability to understand opaque models such as deep neural networks. Vari- ous requirements have been raised from different domains, together with numerous tools to debug, justify outcomes, and establish the safety, fairness and reliability of the mod- els. This variety of tasks has led to inconsistencies in the terminology with, for instance, terms such as interpretable, explainable and transparent being often used interchange- ably in methodology papers. These words, however, convey different meanings and are “weighted” differently across domains, for example in the technical and social sciences. In this paper, we propose an overarching terminology of interpretability of AI systems that can be referred to by the technical developers as much as by the social sciences community to pursue clarity and efficiency in the definition of regulations for ethical and reliable AI development. We show how our taxonomy and definition of interpretable AI differ from the ones in previous research and how they apply with high versatility to several domains and use cases, proposing a—highly needed—standard for the communication among inter- disciplinary areas of AI.

Access

Open Access

Type of Publication

Journal article

Publisher

N/A

access here

02/02/2023

FeTrIL: Feature Translation for Exemplar-Free Class-Incremental Learning

access here

Author(s)

Adrian Popescu; David Picard; Grégoire Petit; Hugo Schindler;

Institution

Université Gustave Eiffel; Université Paris-Saclay;

Abstract

Exemplar-free class-incremental learning is very challenging due to the negative effect of catastrophic forgetting. A balance between stability and plasticity of the incremental process is needed in order to obtain good accuracy for past as well as new classes. Existing exemplar-free class-incremental methods focus either on successive fine tuning of the model, thus favoring plasticity, or on using a feature extractor fixed after the initial incremental state, thus favoring stability. We introduce a method which combines a fixed feature extractor and a pseudo-features generator to improve the stability-plasticity balance. The generator uses a simple yet effective geometric translation of new class features to create representations of past classes, made of pseudo-features. The translation of features only requires the storage of the centroid representations of past classes to produce their pseudo-features. Actual features of new classes and pseudo-features of past classes are fed into a linear classifier which is trained incrementally to discriminate between all classes. The incremental process is much faster with the proposed method compared to mainstream ones which update the entire deep model. Experiments are performed with three challenging datasets, and different incremental settings. A comparison with ten existing methods shows that our method outperforms the others in most cases. FeTrIL code is available at https://github.com/GregoirePetit/FeTrIL.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

01/02/2023

PandA: Unsupervised Learning of Parts and Appearances in the Feature Maps of GANs

access here

Author(s)

Christos Tzelepis; Ioannis Pitas; James Oldfield; Mihalis Nicolaou; Yannis Panagakis

Institution

Cyprus Institute; Queen Mary University of London; University of Athens

Abstract

Recent advances in the understanding of Generative Adversarial Networks (GANs) have led to remarkable progress in visual editing and synthesis tasks, capitalizing on the rich semantics that are embedded in the latent spaces of pre-trained GANs. However, existing methods are often tailored to specific GAN architectures and are limited to either discovering global semantic directions that do not facilitate localized control, or require some form of supervision through manually provided regions or segmentation masks. In this light, we present an architecture-agnostic approach that jointly discovers factors representing spatial parts and their appearances in an entirely unsupervised fashion. These factors are obtained by applying a semi-nonnegative tensor factorization on the feature maps, which in turn enables context-aware local image editing with pixel-level control. In addition, we show that the discovered appearance factors correspond to saliency maps that localize concepts of interest, without using any labels. Experiments on a wide range of GAN architectures and datasets show that, in comparison to the state of the art, our method is far more efficient in terms of training time and, most importantly, provides much more accurate localized control. Our code is available at https://github.com/james-oldfield/PandA.

Access

Open Access

Type of Publication

Conference paper

Publisher

International Conference on Learning Representations

access here

24/01/2023

Source-Free Open Compound Domain Adaptation in Semantic Segmentation

access here

Author(s)

Gim Hee Lee; Nicu Sebe; Yuyang Zhao; Zhiming Luo; Zhun Zhong

Institution

National University of Singapore; University of Trento; Xiamen University

Abstract

In this work, we introduce a new concept, named source-free open compound domain adaptation (SF-OCDA), and study it in semantic segmentation. SF-OCDA is more challenging than the traditional domain adaptation but it is more practical. It jointly considers (1) the issues of data privacy and data storage and (2) the scenario of multiple target domains and unseen open domains. In SF-OCDA, only the source pre-trained model and the target data are available to learn the target model. The model is evaluated on the samples from the target and unseen open domains. To solve this problem, we present an effective framework by separating the training process into two stages: (1) pre-training a generalized source model and (2) adapting a target model with self-supervised learning. In our framework, we propose the Cross-Patch Style Swap (CPSS) to diversify samples with various patch styles in the feature-level, which can benefit the training of both stages. First, CPSS can significantly improve the generalization ability of the source model, providing more accurate pseudo-labels for the latter stage. Second, CPSS can reduce the influence of noisy pseudo-labels and also avoid the model overfitting to the target domain during selfsupervised learning, consistently boosting the performance on the target and open domains. Experiments demonstrate that our method produces state-of-the-art results on the C-Driving dataset. Furthermore, our model also achieves the leading performance on CityScapes for domain generalization.

Access

Open Access

Type of Publication

Journal article

Publisher

IEEE Transactions on Circuits and Systems for Video Technology

access here

24/01/2023

Quasi-equilibrium Feature Pyramid Network for Salient Object Detection

access here

Author(s)

Hao Tang; Mengyi Zhao; Nicu Sebe; Wei Wang; Yue Song

Institution

ETH Zurich; University of Trento;

Abstract

Modern saliency detection models are based on the encoder-decoder framework and they use different strategies to fuse the multi-level features between the encoder and decoder to boost representation power. Motivated by recent work in implicit modelling, we propose to introduce an implicit function to simulate the equilibrium state of the feature pyramid at infinite depths. We question the existence of the ideal equilibrium and thus propose a quasi-equilibrium model by taking the first-order derivative into the black-box root solver using Taylor expansion. It models more realistic convergence states and significantly improves the network performance. We also propose a differentiable edge extractor that directly extracts edges from the saliency masks. By optimizing the extracted edges, the generated saliency masks are naturally optimized on contour constraints and the non-deterministic predictions are removed. We evaluate the proposed methodology on five public datasets and extensive experiments show that our method achieves new state-of-the-art performances on six metrics across datasets.

Access

Closed Access

Type of Publication

Journal article

Publisher

IEEE Transactions on Image Processing

access here

24/01/2023

Disentangle Saliency Detection into Cascaded Detail Modeling and Body Filling

access here

Author(s)

Hao Tang; Nicu Sebe; Wei Wang; Yue Song

Institution

ETH Zurich; University of Trento;

Abstract

Salient object detection has been long studied to identify the most visually attractive objects in images/videos. Recently, a growing number of approaches have been proposed all of which rely on the contour/edge information to improve detection performance. The edge labels are either put into the loss directly or used as extra supervision. The edge and body can also be learned separately and then fused afterward. Both methods either lead to high prediction errors near the edge or cannot be trained in an end-to-end manner. Another problem is that existing methods may fail to detect objects of various sizes due to the lack of efficient and effective feature fusion mechanisms. In this work, we propose to decompose the saliency detection task into two cascaded sub-tasks, i.e., detail modelling and body illing. Specifically, the detail modelling focuses on capturing the object edges by supervision of explicitly decomposed detail label that consists of the pixels that are nested on the edge and near the edge. Then the body illing learns the body part which will be illed into the detail map to generate more accurate saliency map. To effectively fuse the features and handle objects at different scales, we have also proposed two novel multi-scale detail attention and body attention blocks for precise detail and body modelling. Experimental results show that our method achieves state-of-the-art performances on six public datasets.

Access

Open Access

Type of Publication

Journal article

Publisher

ACM Multimedia Systems Conference

access here

24/01/2023

Data Augmentation-free Unsupervised Learning for 3D Point Cloud Understanding

access here

Author(s)

Cristiano Saltori; Elisa Ricci; Fabio Poiesi; Guofeng Mei; Jian Zhang; Nicu Sebe; Qiang Wu

Institution

University of Technology Sydney; University of Trento; Vision Lab Fondazione

Abstract

Unsupervised learning on 3D point clouds has undergone a rapid evolution, especially thanks to data augmentation-based contrastive methods. However, data augmentation is not ideal as it requires a careful selection of the type of augmentations to perform, which in turn can affect the geometric and semantic information learned by the network during selftraining. To overcome this issue, we propose an augmentation-free unsupervised approach for point clouds to learn transferable point-level features via soft clustering, named SoftClu. SoftClu assumes that the points belonging to a cluster should be close to each other in both geometric and feature spaces. This differs from typical contrastive learning, which builds similar representations for a whole point cloud and its augmented versions. We exploit the affiliation of points to their clusters as a proxy to enable self-training through a pseudo-label prediction task. Under the constraint that these pseudo-labels induce the equipartition of the point cloud, we cast SoftClu as an optimal transport problem. We formulate an unsupervised loss to minimize the standard cross-entropy between pseudolabels and predicted labels. Experiments on downstream applications, such as 3D object classification, part segmentation, and semantic segmentation, show the effectiveness of our framework in outperforming state-of-the-art techniques.

Access

Open Access

Type of Publication

Conference paper

Publisher

British Machine Vision Conference

access here

24/01/2023

RankFeat: Rank-1 Feature Removal for Out-of-distribution Detection

access here

Author(s)

Nicu Sebe; Wei Wang; Yue Song

Institution

Beijing Jiaotong University; University of Trento;

Abstract

The task of out-of-distribution (OOD) detection is crucial for deploying machine learning models in real-world settings. In this paper, we observe that the singular value distributions of the in-distribution (ID) and OOD features are quite different: the OOD feature matrix tends to have a larger dominant singular value than the ID feature, and the class predictions of OOD samples are largely determined by it. This observation motivates us to propose RankFeat, a simple yet effective post hoc approach for OOD detection by removing the rank-1 matrix composed of the largest singular value and the associated singular vectors from the high-level feature. RankFeat achieves state-of-the-art performance and reduces the average false positive rate (FPR95) by 17.90% compared with the previous best method. Extensive ablation studies and comprehensive theoretical analyses are presented to support the empirical results.

Access

Open Access

Type of Publication

Conference paper

Publisher

Conference on Neural Information Processing Systems

access here

24/01/2023

Adversarial Style Augmentation for Domain Generalized Urban-Scene Segmentation

access here

Author(s)

Gim Hee Lee; Nicu Sebe; Yuyang Zhao; Zhun Zhong

Institution

National University of Singapore; University of Trento;

Abstract

In this paper, we consider the problem of domain generalization in semantic segmentation, which aims to learn a robust model using only labeled synthetic (source) data. The model is expected to perform well on unseen real (target) domains. Our study finds that the image style variation can largely influence the model’s performance and the style features can be well represented by the channel-wise mean and standard deviation of images. Inspired by this, we propose a novel adversarial style augmentation (AdvStyle) approach, which can dynamically generate hard stylized images during training and thus can effectively prevent the model from overfitting on the source domain. Specifically, AdvStyle regards the style feature as a learnable parameter and updates it by adversarial training. The learned adversarial style feature is used to construct an adversarial image for robust model training. AdvStyle is easy to implement and can be readily applied to different models. Experiments on two synthetic-to-real semantic segmentation benchmarks demonstrate that AdvStyle can significantly improve the model performance on unseen real domains and show that we can achieve the state of the art. Moreover, AdvStyle can be employed to domain generalized image classification and produces a clear improvement on the considered datasets.

Access

Open Access

Type of Publication

Conference paper

Publisher

Conference on Neural Information Processing Systems

access here

24/01/2023

Style-Hallucinated Dual Consistency Learning for Domain Generalized Semantic Segmentation

access here

Author(s)

Gim Hee Lee; Na Zhao; Nicu Sebe; Yuyang Zhao; Zhun Zhong

Institution

National University of Singapore; University of Trento;

Abstract

In this paper, we study the task of synthetic-to-real domain generalized semantic segmentation, which aims to learn a model that is robust to unseen real-world scenes using only synthetic data. The large domain shift between synthetic and real-world data, including the limited source environmental variations and the large distribution gap between synthetic and real-world data, significantly hinders the model performance on unseen real-world scenes. In this work, we propose the Style-HAllucinated Dual consistEncy learning (SHADE) framework to handle such domain shift. Specifically, SHADE is constructed based on two consistency constraints, Style Consistency (SC) and Retrospection Consistency (RC). SC enriches the source situations and encourages the model to learn consistent representation across style-diversified samples. RC leverages real-world knowledge to prevent the model from overfitting to synthetic data and thus largely keeps the representation consistent between the synthetic and real-world models. Furthermore, we present a novel style hallucination module (SHM) to generate style-diversified samples that are essential to consistency learning. SHM selects basis styles from the source distribution, enabling the model to dynamically generate diverse and realistic samples during training. Experiments show that our SHADE yields significant improvement and outperforms state-of-the-art methods by 5.05% and 8.35% on the average mIoU of three real-world datasets on single- and multi-source settings, respectively.

Access

Open Access

Type of Publication

Conference paper

Publisher

European Conference on Computer Vision

access here

24/01/2023

Uncertainty-guided Source-free Domain Adaptation

access here

Author(s)

Andrea Pilzer; Arno Solin; Elisa Ricci; Juho Kannala; Martin Trapp; Nicu Sebe; Subhankar Roy;

Institution

Aalto University; Fondazione Bruno Kessler; NVIDIA; University of Trento;

Abstract

Source-free domain adaptation (SFDA) aims to adapt a classifier to an unlabelled target data set by only using a pre-trained source model. However, the absence of the source data and the domain shift makes the predictions on the target data unreliable. We propose quantifying the uncertainty in the source model predictions and utilizing it to guide the target adaptation. For this, we construct a probabilistic source model by incorporating priors on the network parameters inducing a distribution over the model predictions. Uncertainties are estimated by employing a Laplace approximation and incorporated to identify target data points that do not lie in the source manifold and to down-weight them when maximizing the mutual information on the target data. Unlike recent works, our probabilistic treatment is computationally lightweight, decouples source training and target adaptation, and requires no specialized source training or changes of the model architecture. We show the advantages of uncertainty-guided SFDA over traditional SFDA in the closed-set and open-set settings and provide empirical evidence that our approach is more robust to strong domain shifts even without tuning.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

24/01/2023

Batch-efficient EigenDecomposition for Small and Medium Matrices

access here

Author(s)

Nicu Sebe; Wei Wang; Yue Song

Institution

University of Trento;

Abstract

EigenDecomposition (ED) is at the heart of many computer vision algorithms and applications. One crucial bottleneck limiting its usage is the expensive computation cost, particularly for a mini-batch of matrices in the deep neural networks. In this paper, we propose a QR-based ED method dedicated to the application scenarios of computer vision. Our proposed method performs the ED entirely by batched matrix/vector multiplication, which processes all the matrices simultaneously and thus fully utilizes the power of GPUs. Our technique is based on the explicit QR iterations by Givens rotation with double Wilkinson shifts. With several acceleration techniques, the time complexity of QR iterations is reduced from O(n5) to O(n3). The numerical test shows that for small and medium batched matrices (e.g., dim<32) our method can be much faster than the Pytorch SVD function. Experimental results on visual recognition and image generation demonstrate that our methods also achieve competitive performances

Access

Open Access

Type of Publication

Conference paper

Publisher

European Conference on Computer Vision

access here

24/01/2023

Improving Covariance Conditioning of the SVD Meta-layer by Orthogonality

access here

Author(s)

Nicu Sebe; Wei Wang; Yue Song

Institution

University of Trento;

Abstract

Inserting an SVD meta-layer into neural networks is prone to make the covariance ill-conditioned, which could harm the model in the training stability and generalization abilities. In this paper, we systematically study how to improve the covariance conditioning by enforcing orthogonality to the Pre-SVD layer. Existing orthogonal treatments on the weights are first investigated. However, these techniques can improve the conditioning but would hurt the performance. To avoid such a side effect, we propose the Nearest Orthogonal Gradient (NOG) and Optimal Learning Rate (OLR). The effectiveness of our methods is validated in two applications: decorrelated Batch Normalization (BN) and Global Covariance Pooling (GCP). Extensive experiments on visual recognition demonstrate that our methods can simultaneously improve the covariance conditioning and generalization. Moreover, the combinations with orthogonal weight can further boost the performances.

Access

Open Access

Type of Publication

Conference paper

Publisher

European Conference on Computer Vision

access here

24/01/2023

Class-incremental Novel Class Discovery

access here

Author(s)

Elisa Ricci; Mingxuan Liu; Nicu Sebe; Subhankar Roy; Zhun Zhong

Institution

Fondazione Bruno Kessler; University of Trento;

Abstract

We study the new task of class-incremental Novel Class Discovery (class-iNCD), which refers to the problem of discovering novel categories in an unlabelled data set by leveraging a pre-trained model that has been trained on a labelled data set containing disjoint yet related categories. Apart from discovering novel classes, we also aim at preserving the ability of the model to recognize previously seen base categories. Inspired by rehearsal-based incremental learning methods, in this paper we propose a novel approach for class-iNCD which prevents forgetting of past information about the base classes by jointly exploiting base class feature prototypes and feature-level knowledge distillation. We also propose a self-training clustering strategy that simultaneously clusters novel categories and trains a joint classifier for both the base and novel classes. This makes our method able to operate in a class-incremental setting. Our experiments, conducted on three common benchmarks, demonstrate that our method significantly outperforms state-of-the-art approaches. Code is available at https://github.com/OatmealLiu/class-iNCD.

Access

Open Access

Type of Publication

Conference paper

Publisher

European Conference on Computer Vision

access here

10/01/2023

Transformer reasoning network for image-text matching and retrieval

access here

Author(s)

Andrea Esuli; Fabrizio Falchi; Giuseppe Amato; Nicola Messina;

Institution

ISTI-CNR;

Abstract

Image-text matching is an interesting and fascinating task in modern AI research. Despite the evolution of deep-learning-based image and text processing systems, multi-modal matching remains a challenging problem. In this work, we consider the problem of accurate image-text matching for the task of multi-modal large-scale information retrieval. State-of-the-art results in image-text matching are achieved by inter-playing image and text features from the two different processing pipelines, usually using mutual attention mechanisms. However, this invalidates any chance to extract separate visual and textual features needed for later indexing steps in large-scale retrieval systems. In this regard, we introduce the Transformer Encoder Reasoning Network (TERN), an architecture built upon one of the modern relationship-aware self-attentive architectures, the Transformer Encoder (TE). This architecture is able to separately reason on the two different modalities and to enforce a final common abstract concept space by sharing the weights of the deeper transformer layers. Thanks to this design, the implemented network is able to produce compact and very rich visual and textual features available for the successive indexing step. Experiments are conducted on the MS-COCO dataset, and we evaluate the results using a discounted cumulative gain metric with relevance computed exploiting caption similarities, in order to assess possibly non-exact but relevant search results. We demonstrate that on this metric we are able to achieve state-of-the-art results in the image retrieval task. Our code is freely available at https://github.com/mesnico/TERN

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

09/01/2023

VISIONE at Video Browser Showdown 2023

access here

Author(s)

Claudio Gennaro; Claudio Vairo; Fabio Carrara; Fabrizio Falchi; Giuseppe Amato; Lucia Vadicamo; Nicola Messina; Paolo Bolettieri;

Institution

ISTI-CNR;

Abstract

In this paper, we present the fourth release of VISIONE, a tool for fast and effective video search on a large-scale dataset. It includes several search functionalities like text search, object and color-based search, semantic and visual similarity search, and temporal search. VISIONE uses ad-hoc textual encoding for indexing and searching video content, and it exploits a full-text search engine as search backend. In this new version of the system, we introduced some changes both to the current search techniques and to the user interface.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

07/01/2023

People@Places and ToDY: Two Datasets for Scene Classification in Media Production and Archiving

access here

Author(s)

Hannes Fassold; Werner Bailer;

Institution

Joanneum Research;

Abstract

In order to support common annotation tasks in visual media production and archiving, we propose two datasets which cover the annotation of the bustle of a scene (i.e., populated to unpopulated), the cinematographic type of a shot as well as the time of day and season of a shot. The dataset for bustle and shot type, called People@Places, adds annotations to the Places365 dataset, and the ToDY (time of day/year) dataset adds annotations to the SkyFinder dataset. For both datasets, we provide a toolchain to create automatic annotations, which have been manually verified and corrected for parts of the two datasets. We provide baseline results for these tasks using the EfficientNet-B3 model, pretrained on the Places365 dataset.

Access

Open Access

Type of Publication

Conference paper

Publisher

MultiMedia Modeling

access here

05/01/2023

The Florence 4D Facial Expression Dataset

access here

Author(s)

Alberto Del Bimbo; Claudio Ferrari; Daoudi Mohammed Filippo Principi Naima Otberdout; Stefano Berretti

Institution

University of Florence; University of Parma

Abstract

Human facial expressions change dynamically, so their recognition / analysis should be conducted by accounting for the temporal evolution of face deformations either in 2D or 3D. While abundant 2D video data do exist, this is not the case in 3D, where few 3D dynamic (4D) datasets were released for public use. The negative consequence of this scarcity of data is amplified by current deep learning based-methods for facial expression analysis that require large quantities of variegate samples to be effectively trained. With the aim of smoothing such limitations, in this paper we propose a large dataset, named Florence 4D, composed of dynamic sequences of 3D face models, where a combination of synthetic and real identities exhibit an unprecedented variety of 4D facial expressions, with variations that include the classical neutral-apex transition, but generalize to expression-to-expression. All these characteristics are not exposed by any of the existing 4D datasets and they cannot even be obtained by combining more than one dataset. We strongly believe that making such a data corpora publicly available to the community will allow designing and experimenting new applications that were not possible to investigate till now. To show at some extent the difficulty of our data in terms of different identities and varying expressions, we also report a baseline experimentation on the proposed dataset that can be used as baseline.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

02/01/2023

AdvisIL – A Class-Incremental Learning Advisor

access here

Author(s)

Adrian Popescu; Céline Hudelot; Eva Feillet; Grégoire Petit; Marina Reyboz

Institution

Université Grenoble Alpes; Université Gustave Eiffel; Université Paris-Saclay;

Abstract

Recent class-incremental learning methods combine deep neural architectures and learning algorithms to handle streaming data under memory and computational constraints. The performance of existing methods varies depending on the characteristics of the incremental process. To date, there is no other approach than to test all pairs of learning algorithms and neural architectures on the training data available at the start of the learning process to select a suited algorithm-architecture combination. To tackle this problem, in this article, we introduce AdvisIL, a method which takes as input the main characteristics of the incremental process (memory budget for the deep model, initial number of classes, size of incremental steps) and recommends an adapted pair of learning algorithm and neural architecture. The recommendation is based on a similarity between the user-provided settings and a large set of pre-computed experiments. AdvisIL makes class-incremental learning easier, since users do not need to run cumbersome experiments to design their system. We evaluate our method on four datasets under six incremental settings and three deep model sizes. We compare six algorithms and three deep neural architectures. Results show that AdvisIL has better overall performance than any of the individual combinations of a learning algorithm and a neural architecture. AdvisIL’s code is available at https://github.com/EvaJF/AdvisIL.

Access

Open Access

Type of Publication

Conference paper

Publisher

Winter Conference on Applications of Computer Vision

access here

01/01/2023

Local and Global GANs with Semantic-Aware Upsampling for Image Generation

access here

Author(s)

Hao Tang; Ling Shao; Nicu Sebe; Philip Torr;

Institution

ETH Zurich; University of Oxford; University of Trento;

Abstract

In this paper, we address the task of semantic-guided image generation. One challenge common to most existing image-level generation methods is the difficulty in generating small objects and detailed local textures. To address this, in this work we consider generating images using local context. As such, we design a local class-specific generative network using semantic maps as guidance, which separately constructs and learns subgenerators for different classes, enabling it to capture finer details. To learn more discriminative class-specific feature representations for the local generation, we also propose a novel classification module. To combine the advantages of both global image-level and local class-specific generation, a joint generation network is designed with an attention fusion module and a dual-discriminator structure embedded. Lastly, we propose a novel semantic-aware upsampling method, which has a larger receptive field and can take far-away pixels that are semantically related for feature upsampling, enabling it to better preserve semantic consistency for instances with the same semantic labels. Extensive experiments on two image generation tasks show the superior performance of the proposed method. State-of-the-art results are established by large margins on both tasks and on nine challenging public benchmarks. The source code and trained models are available at https://github.com/Ha0Tang/LGGAN.

Access

Closed Access

Type of Publication

Journal article

Publisher

IEEE Transaction on Pattern Analysis and Machine Intelligence

access here

14/12/2022

Environment Classification via Blind Roomprints Estimation

access here

Author(s)

Artem Yaroshchuk; Luca Cuccovillo; Malte Baum; Patrick Aichroth;

Institution

Fraunhofer IDMT;

Abstract

In this paper we present a novel approach for environment classification for speech recordings, which does not require the selection of decaying reverberation tails. It is based on a multi-band RT60 analysis of blind channel estimates and achieves an accuracy of up to 93.8% on test recordings derived from the ACE corpus.

Access

Open Access

Type of Publication

Conference paper

Publisher

Transactions on Information Forensics and Security

access here

09/12/2022

Concept Embedding Models: Beyond the Accuracy-Explainability Trade-Off

access here

Author(s)

Adrian Weller Francesco Giannini Frédéric Precioso Gabriele Ciravegna Giuseppe Marra Mateja Jamnik Mateo Espinosa Zarlenga Michelangelo Diligenti Pietro Barbiero Pietro Lio Stefano Melacci Zohreh Shams

Institution

KU Leuven Université Côte d'Azur; University of Siena

Abstract

Deploying AI-powered systems requires trustworthy models supporting effective human interactions, going beyond raw prediction accuracy. Concept bottleneck models promote trustworthiness by conditioning classification tasks on an intermediate level of human-like concepts. This enables human interventions which can correct mispredicted concepts to improve the model’s performance. However, existing concept bottleneck models are unable to find optimal compromises between high task accuracy, robust concept-based explanations, and effective interventions on concepts—particularly in real-world conditions where complete and accurate concept supervisions are scarce. To address this, we propose Concept Embedding Models, a novel family of concept bottleneck models which goes beyond the current accuracy-vs-interpretability trade-off by learning interpretable highdimensional concept representations. Our experiments demonstrate that Concept Embedding Models (1) attain better or competitive task accuracy w.r.t. standard neural models without concepts, (2) provide concept representations capturing meaningful semantics including and beyond their ground truth labels, (3) support test-time concept interventions whose effect in test accuracy surpasses that in standard concept bottleneck models, and (4) scale to real-world conditions where complete concept supervisions are scarce.

Access

Open Access

Type of Publication

Conference paper

Publisher

Conference on Neural Information Processing Systems

access here

06/12/2022

Explaining video summarization based on the focus of attention

access here

Author(s)

Evlampios Apostolidis; Georgios Balaouras; Ioannis Patras; Vasileios Mezaris

Institution

CERTH - Center for Research and Technology Hellas Queen Mary University of London;

Abstract

In this paper we propose a method for explaining video summarization. We start by formulating the problem as the creation of an explanation mask which indicates the parts of the video that influenced the most the estimates of a video summarization network, about the frames’ importance. Then, we explain how the typical analysis pipeline of attention-based networks for video summarization can be used to define explanation signals, and we examine various attention-based signals that have been studied as explanations in the NLP domain. We evaluate the performance of these signals by investigating the video summarization network’s input-output relationship according to different replacement functions, and utilizing measures that quantify the capability of explanations to spot the most and least influential parts of a video. We run experiments using an attention-based network (CA-SUM) and two datasets (SumMe and TVSum) for video summarization. Our evaluations indicate the advanced performance of explanations formed using the inherent attention weights, and demonstrate the ability of our method to explain the video summarization results using clues about the focus of the attention mechanism.

Access

Open Access

Type of Publication

Conference paper

Publisher

IEEE International Symposium on Multimedia

access here

03/12/2022

A leap among quantum computing and quantum neural networks: a survey

access here

Author(s)

Fabio Valerio Massoli; Fabrizio Falchi; Giuseppe Amato; Lucia Vadicamo;

Institution

ISTI-CNR;

Abstract

In recent years, Quantum Computing witnessed massive improvements in terms of available resources and algorithms development. The ability to harness quantum phenomena to solve computational problems is a long-standing dream that has drawn the scientific community’s interest since the late 80s. In such a context, we propose our contribution. First, we introduce basic concepts related to quantum computations, and then we explain the core functionalities of technologies that implement the Gate Model and Adiabatic Quantum Computing paradigms. Finally, we gather, compare and analyze the current state-of-the-art concerning Quantum Perceptrons and Quantum Neural Networks implementations.

Access

Open Access

Type of Publication

Journal article

Publisher

ACM Computing Surveys

access here

29/11/2022

Adaptive Soft Contrastive Learning

access here

Author(s)

Chen Feng; Ioannis Patras;

Institution

Queen Mary University of London;

Abstract

Self-supervised learning has recently achieved great success in representation learning without human annotations. The dominant method – that is contrastive learning, is generally based on instance discrimination tasks, i.e., individual samples are treated as independent categories. However, presuming all the samples are different contradicts the natural grouping of similar samples in common visual datasets, e.g., multiple views of the same dog. To bridge the gap, this paper proposes an adaptive method that introduces soft inter-sample relations, namely Adaptive Soft Contrastive Learning (ASCL). More specifically, ASCL transforms the original instance discrimination task into a multi-instance soft discrimination task, and adaptively introduces inter-sample relations. As an effective and concise plug-in module for existing self-supervised learning frameworks, ASCL achieves the best performance on several benchmarks in terms of both performance and efficiency. Code is available at https://github.com/MrChenFeng/ASCL_ICPR2022.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

26/11/2022

Improving the Adversarial Robustness of Neural ODE Image Classifiers by Tuning the Tolerance Parameter

access here

Author(s)

Fabio Carrara; Fabrizio Falchi; Giuseppe Amato; Roberto Cardelli;

Institution

ISTI-CNR; Mercatorum University; National Inter-University Consortium for Telecommunications;

Abstract

The adoption of deep learning-based solutions practically pervades all the diverse areas of our everyday life, showing improved performances with respect to other classical systems. Since many applications deal with sensible data and procedures, a strong demand to know the actual reliability of such technologies is always present. This work analyzes the robustness characteristics of a specific kind of deep neural network, the neural ordinary differential equations (N-ODE) network. They seem very interesting for their effectiveness and a peculiar property based on a test-time tunable parameter that permits obtaining a trade-off between accuracy and efficiency. In addition, adjusting such a tolerance parameter grants robustness against adversarial attacks. Notably, it is worth highlighting how decoupling the values of such a tolerance between training and test time can strongly reduce the attack success rate. On this basis, we show how such tolerance can be adopted, during the prediction phase, to improve the robustness of N-ODE to adversarial attacks. In particular, we demonstrate how we can exploit this property to construct an effective detection strategy and increase the chances of identifying adversarial examples in a non-zero knowledge attack scenario. Our experimental evaluation involved two standard image classification benchmarks. This showed that the proposed detection technique provides high rejection of adversarial examples while maintaining most of the pristine samples.

Access

Open Access

Type of Publication

Journal article

Publisher

N/A

access here

21/11/2022

SSR: An Efficient and Robust Framework for Learning with Unknown Label Noise

access here

Author(s)

Chen Feng; Georgios Tzimiropoulos; Ioannis Patras;

Institution

Queen Mary University of London;

Abstract

Despite the large progress in supervised learning with neural networks, there are significant challenges in obtaining high-quality, large-scale and accurately labelled datasets. In such a context, how to learn in the presence of noisy labels has received more and more attention. As a relatively complex problem, in order to achieve good results, current approaches often integrate components from several fields, such as supervised learning, semi-supervised learning, transfer learning and resulting in complicated methods. Furthermore, they often make multiple assumptions about the type of noise of the data. This affects the model robustness and limits its performance under different noise conditions. In this paper, we consider a novel problem setting, Learning with Unknown Label Noise}(LULN), that is, learning when both the degree and the type of noise are unknown. Under this setting, unlike previous methods that often introduce multiple assumptions and lead to complex solutions, we propose a simple, efficient and robust framework named Sample Selection and Relabelling(SSR), that with a minimal number of hyperparameters achieves SOTA results in various conditions. At the heart of our method is a sample selection and relabelling mechanism based on a non-parametric KNN classifier~(NPK) $g_q$ and a parametric model classifier~(PMC) $g_p$, respectively, to select the clean samples and gradually relabel the noisy samples. Without bells and whistles, such as model co-training, self-supervised pre-training and semi-supervised learning, and with robustness concerning the settings of its few hyper-parameters, our method significantly surpasses previous methods on both CIFAR10/CIFAR100 with synthetic noise and real-world noisy datasets such as WebVision, Clothing1M and ANIMAL-10N. Code is available at https://github.com/MrChenFeng/SSR_BMVC2022.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

15/11/2022

CL2R: Compatible Lifelong Learning Representations

access here

Author(s)

Alberto Del Bimbo; Daniele Mugnai; Federico Pernici; Matteo Bruni; Niccolò Biondi

Institution

University of Florence;

Abstract

In this article, we propose a method to partially mimic natural intelligence for the problem of lifelong learning representations that are compatible. We take the perspective of a learning agent that is interested in recognizing object instances in an open dynamic universe in a way in which any update to its internal feature representation does not render the features in the gallery unusable for visual search. We refer to this learning problem as Compatible Lifelong Learning Representations (CL2 R), as it considers compatible representation learning within the lifelong learning paradigm. We identify stationarity as the property that the feature representation is required to hold to achieve compatibility and propose a novel training procedure that encourages local and global stationarity on the learned representation. Due to stationarity, the statistical properties of the learned features do not change over time, making them interoperable with previously learned features. Extensive experiments on standard benchmark datasets show that our CL2 R training procedure outperforms alternative baselines and state-of-the-art methods. We also provide novel metrics to specifically evaluate compatible representation learning under catastrophic forgetting in various sequential learning tasks. Code is available at https://github.com/NiccoBiondi/CompatibleLifelongRepresentation.

Access

Open Access

Type of Publication

Journal article

Publisher

ACM Multimedia Systems Conference

access here

10/11/2022

AutoGF: Runtime Graph Filter Tuning for Community Node Ranking

access here

Author(s)

Emmanouil Krasanakis; Ioannis Kompatsiaris; Symeon Papadopoulos

Institution

CERTH - Center for Research and Technology Hellas

Abstract

A recurring graph analysis task is to rank nodes based on their relevance to overlapping communities of shared metadata attributes (e.g. the interests of social network users). To achieve this, approaches often start with a few example community members and employ graph filters that rank nodes based on their structural proximity to the examples. Choosing between well-known filters typically involves experiments on existing graphs, but their efficacy is known to depend on the structural relations between community members. Therefore, we argue that employed filters should be determined not during algorithm design but at runtime, upon receiving specific graphs and example nodes to process. To do this, we split example nodes into training and validation sets and either perform supervised selection between well-known filters, or account for granular graph dynamics by tuning parameters of the generalized graph filter form with a novel optimization algorithm. Experiments on 27 community node ranking tasks across three real-world networks of various sizes reveal that runtime algorithm selection selects near-best AUC and NDCG among a list of 8 popular alternatives, and that parameter tuning yields similar or improved results in all cases.

Access

Open Access

Type of Publication

Paper

Publisher

N/A

access here

10/11/2022

AutoGF: Runtime Graph Filter Tuning for Community Node Ranking (preprint of accepted manuscript)

access here

Author(s)

Emmanouil Krasanakis; Ioannis Kompatsiaris; Symeon Papadopoulos

Institution

CERTH - Center for Research and Technology Hellas

Abstract

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

03/11/2022

CL2R: Compatible Lifelong Learning Representations

access here

Author(s)

Alberto Del Bimbo; Daniele Mugnai; Federico Pernici; Matteo Bruni; Niccolò Biondi

Institution

University of Florence;

Abstract

Access

Open Access

Type of Publication

Paper

Publisher

N/A

access here

31/10/2022

Bus Violence: An Open Benchmark for Video Violence Detection on Public Transport

access here

Author(s)

Institution

Italy National Research Council; Silesian University of Technology University of Pisa

Abstract

Access

Open Access

Type of Publication

Journal article

Publisher

Sensors

access here

22/10/2022

When & How to Transfer with Transfer Learning

access here

Author(s)

Adrian Tormos; Dario Garcia-Gasulla; Sergio Alvarez-Napagao; Victor Gimenez-Abalos;

Institution

Barcelona Supercomputing Center;

Abstract

In deep learning, transfer learning (TL) has become the de facto approach when dealing with image related tasks. Visual features learnt for one task have been shown to be reusable for other tasks, improving performance significantly. By reusing deep representations, TL enables the use of deep models in domains with limited data availability, limited computational resources and/or limited access to human experts. Domains which include the vast majority of real-life applications. This paper conducts an experimental evaluation of TL, exploring its trade-offs with respect to performance, environmental footprint, human hours and computational requirements. Results highlight the cases were a cheap feature extraction approach is preferable, and the situations where an expensive fine-tuning effort may be worth the added cost. Finally, a set of guidelines on the use of TL are proposed.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

20/10/2022

pygrank: A Python Package for Graph Node Ranking

access here

Author(s)

Andreas Symeonidis; Emmanouil Krasanakis; Ioannis Kompatsiaris; Symeon Papadopoulos

Institution

Aristotle University of Thessaloniki; CERTH - Center for Research and Technology Hellas

Abstract

We introduce pygrank, an open source Python package to define, run and evaluate node ranking algorithms. We provide object-oriented and extensively unit-tested algorithmic components, such as graph filters, post-processors, measures, benchmarks, and online tuning. Computations can be delegated to numpy, tensorflow, or pytorch backends and fit in back-propagation pipelines. Classes can be combined to define interoperable complex algorithms. Within the context of this paper, we compare the package with related alternatives, describe its architecture, demonstrate its flexibility and ease of use with code examples, and discuss its impact.

Access

Open Access

Type of Publication

Journal article

Publisher

SoftwareX

access here

18/10/2022

Tuning Neural ODE Networks to Increase Adversarial Robustness in Image Forensics

access here

Author(s)

Fabio Carrara; Fabrizio Falchi; Roberto Caldelli

Institution

ISTI-CNR;

Abstract

Although deep-learning-based solutions are pervading different application sectors, many doubts have arisen about their reliability and, above all, their security against threats that can mislead their decision mechanisms. In this work, we considered a particular kind of deep neural network, the Neural Ordinary Differential Equations (N-ODE) networks, which have shown intrinsic robustness against adversarial samples by properly tuning their tolerance parameter at test time. Their behaviour has never been investigated in image forensics tasks such as distinguishing between an original and an altered image. Following this direction, we demonstrate how tuning the tolerance parameter during the prediction phase can control and increase N-ODE’s robustness versus adversarial attacks. We performed experiments on basic image transformations used to generate tampered data, providing encouraging results in terms of adversarial rejection and preservation of the correct classification of pristine images.

Access

Open Access

Type of Publication

Publication

Publisher

IEEE International Conference on Image Processing

access here

16/10/2022

On The Link Between Emotion, Attention And Content In Virtual Immersive Environments

access here

Author(s)

Aldric Ducreux; Auriane Gros; Camille Bauce; Florent Robert; Hui-Yin Wu; Lucile Sassatelli; Marco Wincler; Quentin Guimard

Institution

Institut Universitaire de France; Université Côte d'Azur;

Abstract

While immersive media have been shown to generate more intense emotions, saliency information has been shown to be a key component for the assessment of their quality, owing to the various portions of the sphere (viewports) a user can attend. In this article, we investigate the tri-partite connection between user attention, user emotion and visual content in immersive environments. To do so, we present a new dataset enabling the analysis of different types of saliency, both low-level and high-level, in connection with the user’s state in 360° videos. Head and gaze movements are recorded along with self-reports and continuous physiological measurements of emotions. We then study how the accuracy of saliency estimators in predicting user attention depends on user-reported and physiologically-sensed emotional perceptions. Our results show that high-level saliency better predicts user attention for higher levels of arousal. We discuss how this work serves as a first step to understand and predict user attention and intents in immersive interactive environments.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

07/10/2022

Deep Features for CBIR with Scarce Data using Hebbian Learning

access here

Author(s)

Claudio Gallicchio; Claudio Gennaro; Davide Bacciu; Fabrizio Falchi; Gabriele Lagani; Giuseppe Amato;

Institution

ISTI-CNR; University of Pisa

Abstract

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

07/10/2022

ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval

access here

Author(s)

Fabrizio Falchi; Giuseppe Amato; Lorenzo Baraldi; Marcella Cornia; Matteo Stefanini; Nicola Messina; Rita Cucchiara;

Institution

ISTI-CNR; University of Modena and Reggio Emilia;

Abstract

Access

Open Access

Type of Publication

Conference paper

Publisher

Conference on Content-based Multimedia Indexing

access here

05/10/2022

FastHebb: Scaling Hebbian Training of Deep Neural Networks to ImageNet Level

access here

Author(s)

Claudio Gennaro; Gabriele Lagani; Giuseppe Amato; Hannes Fassold;

Institution

ISTI-CNR; Joanneum Research; University of Pisa

Abstract

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

01/10/2022

Learning from Label Relationships in Human Affect

access here

Author(s)

Ioannis Patras; Niki Maria Foteinopoulou

Institution

Queen Mary University of London;

Abstract

Human affect and mental state estimation in an automated manner, face a number of difficulties, including learning from labels with poor or no temporal resolution, learning from few datasets with little data (often due to confidentiality constraints) and, (very) long, in-the-wild videos. For these reasons, deep learning methodologies tend to overfit, that is, arrive at latent representations with poor generalisation performance on the final regression task. To overcome this, in this work, we introduce two complementary contributions. First, we introduce a novel relational loss for multilabel regression and ordinal problems that regularises learning and leads to better generalisation. The proposed loss uses label vector inter-relational information to learn better latent representations by aligning batch label distances to the distances in the latent feature space. Second, we utilise a two-stage attention architecture that estimates a target for each clip by using features from the neighbouring clips as temporal context. We evaluate the proposed methodology on both continuous affect and schizophrenia severity estimation problems, as there are methodological and contextual parallels between the two. Experimental results demonstrate that the proposed methodology outperforms the baselines that are trained using the supervised regression loss, as well as pre-training the network architecture with an unsupervised contrastive loss. In the domain of schizophrenia, the proposed methodology outperforms previous state-of-the-art by a large margin, achieving a PCC of up to 78% performance close to that of human experts (85%) and much higher than previous works (uplift of up to 40%). In the case of affect recognition, we outperform previous vision-based methods in terms of CCC on both the OMG and the AMIGOS datasets. Specifically for AMIGOS, we outperform previous SoTA CCC for both arousal and valence by 9% and 13% respectively, and in the OMG dataset we outperform previous vision works by up to 5% for both arousal and valence.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

28/09/2022

Approximate Nearest Neighbor Search on Standard Search Engines

access here

Author(s)

Claudio Gennaro; Fabio Carrara; Giuseppe Amato; Lucia Vadicamo;

Institution

ISTI-CNR;

Abstract

Approximate search for high-dimensional vectors is commonly addressed using dedicated techniques often combined with hardware acceleration provided by GPUs, FPGAs, and other custom in-memory silicon. Despite their effectiveness, harmonizing those optimized solutions with other types of searches often poses technological difficulties. For example, to implement a combined text+image multimodal search, we are forced first to query the index of high-dimensional image descriptors and then filter the results based on the textual query or vice versa. This paper proposes a text surrogate technique to translate real-valued vectors into text and index them with a standard textual search engine such as Elasticsearch or Apache Lucene. This technique allows us to perform approximate kNN searches of high-dimensional vectors alongside classical full-text searches natively on a single textual search engine, enabling multimedia queries without sacrificing scalability. Our proposal exploits a combination of vector quantization and scalar quantization. We compared our approach to the existing literature in this field of research, demonstrating a significant improvement in performance through preliminary experimentation.

Access

Open Access

Type of Publication

Journal article

Publisher

N/A

access here

27/09/2022

StyleMask: Disentangling the Style Space of StyleGAN2 for Neural Face Reenactment

access here

Author(s)

Christos Tzelepis; Georgios Tzimiropoulos; Ioannis Patras; Stella Bounareli; Vasileios Argyriou;

Institution

Kingston University London; Queen Mary University of London;

Abstract

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

26/09/2022

3D-Aware Semantic-Guided Generative Model for Human Synthesis

access here

Author(s)

Aliaksandr Siarohin; Enver Sangineto; Hao Tang; Jichao Zhang; Nicu Sebe; Wei Wang; Zhun Zhong

Institution

ETH Zurich; Snap Research; University of Modena and Reggio Emilia; University of Trento;

Abstract

Generative Neural Radiance Field (GNeRF) models, which extract implicit 3D representations from 2D images, have recently been shown to produce realistic images representing rigid/semi-rigid objects, such as human faces or cars. However, they usually struggle to generate high-quality images representing non-rigid objects, such as the human body, which is of a great interest for many computer graphics applications. This paper proposes a 3D-aware Semantic-Guided Generative Model (3D-SGAN) for human image synthesis, which combines a GNeRF with a texture generator. The former learns an implicit 3D representation of the human body and outputs a set of 2D semantic segmentation masks. The latter transforms these semantic masks into a real image, adding a realistic texture to the human appearance. Without requiring additional 3D information, our model can learn 3D human representations with a photo-realistic, controllable generation. Our experiments on the DeepFashion dataset show that 3D-SGAN significantly outperforms the most recent baselines. The code is available at https://github.com/zhangqianhui/3DSGAN.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

23/09/2022

Proceedings of the 2nd International Workshop on Learning to Quantify (LQ 2022)

access here

Author(s)

Alejandro Moreo; Fabrizio Sebastiani; Juan José del Coz;

Institution

ISTI-CNR; University of Oviedo

Abstract

The 2nd International Workshop on Learning to Quantify (LQ 2022 – https://lq-2022.github.io/) was held in Grenoble, FR, on September 23, 2022, as a satellite workshop of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD 2022). While the 1st edition of the workshop (LQ 2021 – https://cikmlq2021. github.io/, which was instead co-located with the 30th ACM International Conference on Information and Knowledge Management (CIKM 2021)) had to be an entirely online event, LQ 2022 was a hybrid event, with presentations given in-presence and both in-presence attendees and remote attendees.

Access

Open Access

Type of Publication

Book

Publisher

N/A

access here

21/09/2022

Low-budget label query through domain alignment enforcement

access here

Author(s)

Cristiano Saltori; Jurandy Almeida; Nicu Sebe; Paolo Rota;

Institution

Universidade Federal de São Paulo; University of Trento;

Abstract

Deep learning revolution happened thanks to the availability of a massive amount of labeled data which contributed to the development of models with extraordinary inference capabilities. Despite the public availability of large-scale datasets, to address specific requirements it is often necessary to generate a new set of labeled data whose production is often costly and require specific know-how to be fulfilled. In this work, we propose the new problem of low-budget label query, which aims at maximizing the classification performance by selecting a convenient and small set of samples (i.e., low budget) to be manually labeled from an arbitrary big set of unlabeled data. While a first solution might be the use of pre-trained models with standard selection metrics, i.e. confidence and entropy, we argue that domain shift affects their reliability. We deem that Unsupervised Domain Adaptation (UDA) can be used to reduce domain shift, making selection metrics more reliable and less noisy. Therefore, we first improve an UDA method to better align source and target domains using consistency constraints, reaching comparable performance with the state of-the-art on several UDA tasks. After adaptation, we conduct an extensive experimental study with commonly used confidence metrics and sampling strategies to achieve low-budget label query on a large variety of publicly available datasets and under different setups.

Access

Closed Access

Type of Publication

Journal article

Publisher

Computer Vision and Image Understanding

access here

19/09/2022

Ordinal Quantification through Regularization

access here

Author(s)

Alejandro Moreo; Fabrizio Sebastiani; Martin Senz; Mirko Bunse;

Institution

ISTI-CNR;

Abstract

Quantification,i.e.,thetaskoftrainingpredictorsoftheclass prevalence values in sets of unlabelled data items, has received increased attention in recent years. However, most quantification research has con- centrated on developing algorithms for binary and multiclass problems in which the classes are not ordered. We here study the ordinal case, i.e., the case in which a total order is defined on the set of n > 2 classes. We give three main contributions to this field. First, we create and make available two datasets for ordinal quantification (OQ) research that overcome the inadequacies of the previously available ones. Second, we experimentally compare the most important OQ algorithms proposed in the literature so far. To this end, we bring together algorithms that are proposed by authors from very different research fields, who were unaware of each other’s developments. Third, we propose three OQ algorithms, based on the idea of preventing ordinally implausible estimates through regu- larization. Our experiments show that these algorithms outperform the existing ones if the ordinal plausibility assumption holds.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

18/09/2022

Generalized Funnelling: Ensemble Learning and Heterogeneous Document Embeddings for Cross-Lingual Text Classification

access here

Author(s)

Alejandro Moreo; Andrea Pedrotti; Fabrizio Sebastiani;

Institution

ISTI-CNR; University of Pisa

Abstract

Funnelling (Fun) is a recently proposed method for cross-lingual text classification (CLTC) based on a two-tier learning ensemble for heterogeneous transfer learning (HTL). In this ensemble method, 1st-tier classifiers, each working on a different and language-dependent feature space, return a vector of calibrated posterior probabilities (with one dimension for each class) for each document, and the final classification decision is taken by a meta-classifier that uses this vector as its input. The meta-classifier can thus exploit class-class correlations, and this (among other things) gives Fun an edge over CLTC systems in which these correlations cannot be brought to bear. In this paper we describe Generalized Funnelling (gFun), a generalisation of Fun consisting of an HTL architecture in which 1st-tier components can be arbitrary view-generating functions, i.e., language-dependent functions that each produce a language-independent representation (“view”) of the (monolingual) document. We describe an instance of gFun in which the meta-classifier receives as input a vector of calibrated posterior probabilities (as in Fun) aggregated to other embedded representations that embody other types of correlations, such as word-class correlations (as encoded by Word-Class Embeddings), word-word correlations (as encoded by Multilingual Unsupervised or Supervised Embeddings), and word-context correlations (as encoded by multilingual BERT ). We show that this instance of gFun substantially improves over Fun and over state-of-the-art baselines, by reporting experimental results obtained on two large, standard datasets for multilingual multilabel text classification. Our code that implements gFun is publicly available.

Access

Open Access

Type of Publication

Journal article

Publisher

ACM Transactions on Information Systems

access here

18/09/2022

A Concise Overview of LeQua@CLEF 2022: Learning to Quantify

access here

Author(s)

Alejandro Moreo; Andrea Esuli; Fabrizio Sebastiani; Gianluca Sperduti

Institution

ISTI-CNR;

Abstract

LeQua 2022 is a new lab for the evaluation of methods for “learning to quantify” in textual datasets, i.e., for training predictors of the relative frequencies of the classes of interest Y = {y1,…,yn} in sets of unlabelled textual documents. While these predictions could be easily achieved by first classifying all documents via a text classifier and then counting the numbers of documents assigned to the classes, a growing body of literature has shown this approach to be suboptimal, and has proposed better methods. The goal of this lab is to provide a setting for the comparative evaluation of methods for learning to quantify, both in the binary setting and in the single-label multiclass setting; this is the first time that an evaluation exercise solely dedicated to quantification is organized. For both the binary setting and the single-label multiclass setting, data were provided to participants both in ready-made vector form and in raw document form. In this overview article we describe the structure of the lab, we report the results obtained by the participants in the four proposed tasks and subtasks, and we comment on the lessons that can be learned from these results.

Access

Open Access

Type of Publication

Conference paper

Publisher

Conference and Labs of the Evaluation Forum

access here

14/09/2022

AI and computer vision for smart cities

access here

Author(s)

Claudio Gennaro; Claudio Vairo; Fabio Carrara; Fabrizio Falchi; Giuseppe Amato; Luca Ciampi; Marco Di Benedetto Nicola Messina;

Institution

ISTI-CNR;

Abstract

Artificial Intelligence (AI) is increasingly employed to develop public services that make life easier for citizens. In this abstract, we present some research topics and applications carried out by the Artificial Intelligence for Media and Humanities (AIMH) laboratory of the ISTI-CNR of Pisa about the study and development of AI-based services for Smart Cities dedicated to the interaction with the physical world through the analysis of images gathered from city cameras. Like no other sensing mechanism, networks of city cameras can ‘observe’ the world and simultaneously provide visual data to AI systems to extract relevant information and make/suggest decisions helping to solve many real-world problems. Specifically, we discuss some solutions in the context of smart mobility, parking monitoring, infrastructure management, and surveillance systems.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

07/09/2022

Open-Ended Evolution for Minecraft Building Generation

access here

Author(s)

Antonios Liapis; Georgios N. Yannakakis; Matthew Barthet

Institution

University of Malta

Abstract

This paper proposes a procedural content generator which evolves Minecraft buildings according to an open-ended and intrinsic definition of novelty. To realize this goal we evaluate individuals’ novelty in the latent space using a 3D autoencoder, and alternate between phases of exploration and transformation. During exploration the system evolves multiple populations of CPPNs through CPPN-NEAT and constrained novelty search in the latent space (defined by the current autoencoder). We apply a set of repair and constraint functions to ensure candidates adhere to basic structural rules and constraints during evolution. During transformation, we reshape the boundaries of the latent space to identify new interesting areas of the solution space by retraining the autoencoder with novel content. In this study we evaluate five different approaches for training the autoencoder during transformation and its impact on populations’ quality and diversity during evolution. Our results show that by retraining the autoencoder we can achieve better open-ended complexity compared to a static model, which is further improved when retraining using larger datasets of individuals with diverse complexities.

Access

Open Access

Type of Publication

Journal article

Publisher

IEEE Transactions on Games

access here

06/09/2022

Differentiable Piano Model for MIDI-to-Audio Performance Synthesis

access here

Author(s)

Axel Roebel; Lenny Renault; Rémi Mignot;

Institution

Sorbonne Université

Abstract

Recent neural-based synthesis models have achieved impressive results for musical instrument sound generation. In particular, the Differentiable Digital Signal Processing (DDSP) framework enables the usage of spectral modeling analysis and synthesis techniques in fully differentiable architectures. Yet currently, it has only been used for modeling monophonic instruments. Leveraging the interpretability and modularity of this framework, the present work introduces a polyphonic differentiable model for piano sound synthesis, conditioned on Musical Instrument Digital Interface (MIDI) inputs. The model architecture is motivated by high-level acoustic modeling knowledge of the instrument which, in tandem with the sound structure priors inherent to the DDSP components, makes for a lightweight, interpretable and realistic sounding piano model. The proposed model has been evaluated in a listening test, demonstrating improved sound quality compared to a benchmark neural-based piano model, with significantly less parameters and even with reduced training data. The same listening test indicates that physical-modeling-based models still achieve better quality, but the differentiability of our lightened approach encourages its usage in other musical tasks dealing with polyphonic audio and symbolic data.

Access

Open Access

Type of Publication

Conference paper

Publisher

International Conference on Digital Audio Effects

access here

05/09/2022

A survey on bias in visual datasets

access here

Author(s)

Eirini Ntoutsi; Ioannis Kompatsiaris; Simone Fabbrizzi; Symeon Papadopoulos

Institution

CERTH - Center for Research and Technology Hellas Freie Universität; Leibniz Universität:

Abstract

Computer Vision (CV) has achieved remarkable results, outperforming humans in several tasks. Nonetheless, it may result in significant discrimination if not handled properly as CV systems highly depend on training datasets and can learn and amplify biases that such datasets may carry. Thus, the problem of understanding and discovering bias in visual datasets is of utmost importance; yet, it has not been studied in a systematic way to date. Hence, this work aims to: (i) describe the different kinds of bias that may manifest in visual datasets; (ii) review the literature on methods for bias discovery and quantification in visual datasets; (iii) discuss existing attempts to collect visual datasets in a bias-aware manner. A key conclusion of our study is that the problem of bias discovery and quantification in visual datasets is still open, and there is room for improvement in terms of both methods and the range of biases that can be addressed. Moreover, there is no such thing as a bias-free dataset, so scientists and practitioners must become aware of the biases in their datasets and make them explicit. To this end, we propose a checklist to spot different types of bias during visual dataset collection.

Access

Open Access

Type of Publication

Journal article

Publisher

access here

05/09/2022

A Detailed Overview of LeQua@CLEF 2022: Learning to Quantify

access here

Author(s)

Alejandro Moreo; Andrea Esuli; Fabrizio Sebastiani; Gianluca Sperduti

Institution

ISTI-CNR;

Abstract

LeQua 2022 is a new lab for the evaluation of methods for “learning to quantify” in textual datasets, i.e., for training predictors of the relative frequencies of the classes of interest 𝒴 = {𝑦1 , …, 𝑦𝑛 } in sets of unlabelled textual documents. While these predictions could be easily achieved by first classifying all documents via a text classifier and then counting the numbers of documents assigned to the classes, a growing body of literature has shown this approach to be suboptimal, and has proposed better methods. The goal of this lab is to provide a setting for the comparative evaluation of methods for learning to quantify, both in the binary setting and in the single-label multiclass setting; this is the first time that an evaluation exercise solely dedicated to quantification is organized. For both the binary setting and the single-label multiclass setting, data were provided to participants both in ready-made vector form and in raw document form. In this overview article we describe the structure of the lab, we report the results obtained by the participants in the four proposed tasks and subtasks, and we comment on the lessons that can be learned from these results.

Access

Open Access

Type of Publication

Conference paper

Publisher

Conference and Labs of the Evaluation Forum

access here

01/09/2022

Deep Networks for Behavioral Variant Frontotemporal Dementia Identification from Multiple Acquisition Sources

access here

Author(s)

Benedetta Tafuri; Claudio Gennaro; Fabio Carrara; Fabrizio Falchi; Giancarlo Logroscino; Giuseppe Amato; Giuseppe Gigli; Marco Di Benedetto Roberto De Blasi; Salvatore Nigro;

Institution

ISTI-CNR; University of Bari; University of Salento;

Abstract

Behavioral variant frontotemporal dementia (bvFTD) is a neurodegenerative syndrome whose clinical diagnosis remains a challenging task especially in the early stage of the disease. Currently, the presence of frontal and anterior temporal lobe atrophies on magnetic resonance imaging (MRI) is part of the diagnostic criteria for bvFTD. However, MRI data processing is usually dependent on the acquisition device and mostly require human-assisted crafting of feature extraction.
Following the impressive improvements of deep architectures, in this study we report on bvFTD identification using various classes of artificial neural networks, and present the results we achieved on classification accuracy and obliviousness on acquisition devices using extensive hyperparameter search.
In particular, we will demonstrate the stability and generalization of different deep networks based on the attention mechanism, where data intra-mixing confers models the ability to identify the disorder even on MRI data in inter-device settings, i.e., on data produced by different acquisition devices and without model fine tuning, as shown from the very encouraging performance evaluations that dramatically reach and overcome the 90% value on the AuROC and balanced accuracy metrics.

Access

Open Access

Type of Publication

Journal article

Publisher

N/A

access here

29/08/2022

Speaker-Independent Microphone Identification in Noisy Conditions

access here

Author(s)

Antonio Giganti; Luca Cuccovillo; Paolo Bestagini; Patrick Aichroth; Stefano Tubaro;

Institution

Fraunhofer IDMT; Politecnico di Milano;

Abstract

This work proposes a method for source device identification from speech recordings that applies neural-network-based denoising, to mitigate the impact of counter-forensics attacks using noise injection. The method is evaluated by comparing the impact of denoising on three state-of-the-art features for microphone classification, determining their discriminating power with and without denoising being applied. The proposed framework achieves a significant performance increase for noisy material, and more generally, validates the usefulness of applying denoising prior to device identification for noisy recordings.

Access

Open Access

Type of Publication

Conference paper

Publisher

European Signal Processing Conference

access here

26/08/2022

Play with Emotion: Affect-Driven Reinforcement Learning

access here

Author(s)

Ahmed Khalifa; Antonios Liapis; Georgios N. Yannakakis; Matthew Barthet

Institution

University of Malta

Abstract

This paper introduces a paradigm shift by viewing the task of affect modeling as a reinforcement learning (RL) process. According to the proposed paradigm, RL agents learn a policy (i.e. affective interaction) by attempting to maximize a set of rewards (i.e. behavioral and affective patterns) via their experience with their environment (i.e. context). Our hypothesis is that RL is an effective paradigm for interweaving affect elicitation and manifestation with behavioral and affective demonstrations. Importantly, our second hypothesis-building on Damasio’s so-matic marker hypothesis-is that emotion can be the facilitator of decision-making. We test our hypotheses in a racing game by training Go-Blend agents to model human demonstrations of arousal and behavior; Go-Blend is a modified version of the Go-Explore algorithm which has recently showcased supreme performance in hard exploration tasks. We first vary the arousal-based reward function and observe agents that can effectively display a palette of affect and behavioral patterns according to the specified reward. Then we use arousal-based state selection mechanisms in order to bias the strategies that Go-Blend explores. Our findings suggest that Go-Blend not only is an efficient affect modeling paradigm but, more importantly, affect-driven RL improves exploration and yields higher performing agents, validating Damasio’s hypothesis in the domain of games.

Access

Open Access

Type of Publication

Conference paper

Publisher

International Conference on Affective Computing and Intelligent Interaction Workshops

access here

26/08/2022

Generative Personas That Behave and Experience Like Humans

access here

Author(s)

Ahmed Khalifa; Antonios Liapis; Georgios N. Yannakakis; Matthew Barthet

Institution

University of Malta

Abstract

Using artificial intelligence (AI) to automatically test a game remains a critical challenge for the development of richer and more complex game worlds and for the advancement of AI at large. One of the most promising methods for achieving that long-standing goal is the use of generative AI agents, namely procedural personas, that attempt to imitate particular playing behaviors which are represented as rules, rewards, or human demonstrations. All research efforts for building those generative agents, however, have focused solely on playing behavior which is arguably a narrow perspective of what a player actually does in a game. Motivated by this gap in the existing state of the art, in this paper we extend the notion of behavioral procedural personas to cater for player experience, thus examining generative agents that can both behave and experience their game as humans would. For that purpose, we employ the Go-Explore reinforcement learning paradigm for training human-like procedural personas, and we test our method on behavior and experience demonstrations of more than 100 players of a racing game. Our findings suggest that the generated agents exhibit distinctive play styles and experience responses of the human personas they were designed to imitate. Importantly, it also appears that experience, which is tied to playing behavior, can be a highly informative driver for better behavioral exploration.

Access

Open Access

Type of Publication

Conference paper

Publisher

Conference on the Foundations of Digital Games

access here

25/08/2022

Supervised Contrastive Learning for Affect Modelling

access here

Author(s)

Antonios Liapis; Georgios N. Yannakakis; Konstantinos Makantasis; Kosmas Pinitas;

Institution

University of Malta

Abstract

Affect modeling is viewed, traditionally, as the process of mapping measurable affect manifestations from multiple modalities of user input to affect labels. That mapping is usually inferred through end-to-end (manifestation-to-affect) machine learning processes. What if, instead, one trains general, subject-invariant representations that consider affect information and then uses such representations to model affect? In this paper we assume that affect labels form an integral part, and not just the training signal, of an affect representation and we explore how the recent paradigm of contrastive learning can be employed to discover general high-level affect-infused representations for the purpose of modeling affect. We introduce three different supervised contrastive learning approaches for training representations that consider affect information. In this initial study we test the proposed methods for arousal prediction in the RECOLA dataset based on user information from multiple modalities. Results demonstrate the representation capacity of contrastive learning and its efficiency in boosting the accuracy of affect models. Beyond their evidenced higher performance compared to end-to-end arousal classification, the resulting representations are general-purpose and subject-agnostic, as training is guided though general affect information available in any multimodal corpus.

Access

Open Access

Type of Publication

Conference paper

Publisher

Conference on Multimodal Interaction

access here

25/08/2022

Neural Knowledge Transfer for Sentiment Analysis in Texts with Figurative Language

access here

Author(s)

Dionysios Karamouzas; Ioannis Mademlis; Ioannis Pitas;

Institution

Aristotle University of Thessaloniki;

Abstract

Sentiment analysis in texts, also known as opinion mining, is a significant Natural Language Processing (NLP) task, with many applications in automated social media monitoring, customer feedback processing, e-mail scanning, etc. Despite recent progress due to advances in Deep Neural Networks (DNNs), texts containing figurative language (e.g., sarcasm, irony, metaphors) still pose a challenge to existing methods due to the semantic ambiguities they entail. In this paper, a novel setup of neural knowledge transfer is proposed for DNN-based sentiment analysis of figurative texts. It is employed for distilling knowledge from a pretrained binary recognizer of figurative language into a multiclass sentiment classifier, while the latter is being trained under a multitask setting. Thus, hints about figurativeness implicitly help resolve semantic ambiguities. Evaluation on a relevant public dataset indicates that the proposed method leads to state-of-the-art accuracy.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

23/08/2022

Multi-Attribute Balanced Sampling for Disentangled GAN Controls

access here

Author(s)

Hervé Le Borgne; Michel Crucianu; Nicolas Audebert; Perla Doubinsky

Institution

CEA; CNAM;

Abstract

Various controls over the generated data can be extracted from the latent space of a pre-trained GAN, as it implicitly encodes the semantics of the training data. The discovered controls allow to vary semantic attributes in the generated images but usually lead to entangled edits that affect multiple attributes at the same time. Supervised approaches typically sample and annotate a collection of latent codes, then train classifiers in the latent space to identify the controls. Since the data generated by GANs reflects the biases of the original dataset, so do the resulting semantic controls. We propose to address disentanglement by balancing the semantics of the dataset before training the classifiers. We demonstrate the effectiveness of this approach by extracting disentangled linear directions for face manipulation on state-of-the-art GAN architectures (including StyleGAN2 and StyleGAN3) and two datasets, CelebAHQ and FFHQ. We show that this simple and general approach outperforms state-of-the-art classifier-based methods while avoiding the need for disentanglement-enforcing post-processing.

Access

Open Access

Type of Publication

Journal article

Publisher

Pattern Recognition Letters

access here

22/08/2022

Unsupervised Domain Adaptation for Video Transformers in Action Recognition

access here

Author(s)

Elisa Ricci; Giancomo Zara: Nicu Sebe; Paolo Rota; Thiago Oliveira-Santos; Victor da Costa Vittorio Murino;

Institution

Universidade Federal do Espírito Santo; University of Trento; University of Verona

Abstract

Over the last few years, Unsupervised Domain Adaptation (UDA) techniques have acquired remarkable importance and popularity in computer vision. However, when compared to the extensive literature available for images, the field of videos is still relatively unexplored. On the other hand, the performance of a model in action recognition is heavily affected by domain shift. In this paper, we propose a simple and novel UDA approach for video action recognition. Our approach leverages recent advances on spatio-temporal transformers to build a robust source model that better generalises to the target domain. Furthermore, our architecture learns domain invariant features thanks to the introduction of a novel alignment loss term derived from the Information Bottleneck principle. We report results on two video action recognition benchmarks for UDA, showing state-of-the-art performance on HMDB->UCF, as well as on Kinetics<->EC-Drone, which is more challenging. This demonstrates the effectiveness of our method in handling different levels of domain shift. The source code is available at https://github.com/vturrisi/UDAVT

Access

Closed Access

Type of Publication

Conference paper

Publisher

International Conference on Pattern Recognition

access here

22/08/2022

A full data augmentation pipeline for small object detection based on generative adversarial networks

access here

Author(s)

Alberto Del Bimbo; Brais Bosquet Daniel Cores Lorenzo Seidenari; Manuel Mucientes Victor Brea

Institution

University of Florence; University of Santiago de Compostela

Abstract

Object detection accuracy on small objects, i.e., objects under 32 × 32 pixels, lags behind that of large ones. To address this issue, innovative architectures have been designed and new datasets have been released. Still, the number of small objects in many datasets does not suffice for training. The advent of the generative adversarial networks (GANs) opens up a new data augmentation possibility for training architectures without the costly task of annotating huge datasets for small objects. In this paper, we propose a full pipeline for data augmentation for small object detection which combines a GAN-based object generator with techniques of object segmentation, image inpainting, and image blending to achieve high-quality synthetic data. The main component of our pipeline is DS-GAN, a novel GAN-based architecture that generates realistic small objects from larger ones. Experimental results show that our overall data augmentation method improves the performance of state-of-the-art models up to 11.9% AP@.5 on UAVDT and by 4.7% AP@.5 on iSAID, both for the small objects subset and for a scenario where the number of training instances is limited.

Access

Open Access

Type of Publication

Journal article

Publisher

Pattern Recognition Letters

access here

09/08/2022

GAP: Differentially Private Graph Neural Networks with Aggregation Perturbation

access here

Author(s)

Ali Shahin Shamsabadi; Aurélien Bellet: Daniel Gatica-Perez; Sina Sajadmanesh

Institution

Alan Turing Institute; EPFL; Idiap Research Institute Inria;

Abstract

In this paper, we study the problem of learning Graph Neural Networks (GNNs) with Differential Privacy (DP). We propose a novel differentially private GNN based on Aggregation Perturbation (GAP), which adds stochastic noise to the GNN’s aggregation function to statistically obfuscate the presence of a single edge (edge-level privacy) or a single node and all its adjacent edges (node-level privacy). Tailored to the specifics of private learning, GAP’s new architecture is composed of three separate modules: (i) the encoder module, where we learn private node embeddings without relying on the edge information; (ii) the aggregation module, where we compute noisy aggregated node embeddings based on the graph structure; and (iii) the classification module, where we train a neural network on the private aggregations for node classification without further querying the graph edges. GAP’s major advantage over previous approaches is that it can benefit from multi-hop neighborhood aggregations, and guarantees both edge-level and node-level DP not only for training, but also at inference with no additional costs beyond the training’s privacy budget. We analyze GAP’s formal privacy guarantees using Rényi DP and conduct empirical experiments over three real-world graph datasets. We demonstrate that GAP offers significantly better accuracy-privacy trade-offs than state-of-the-art DP-GNN approaches and naive MLP-based baselines.

Access

Open Access

Type of Publication

Conference paper

Publisher

USENIX Security Symposium

access here

01/08/2022

An Embedded Toolset for Human Activity Monitoring in Critical Environments

access here

Author(s)

Claudio Gennaro; Fabio Carrara; Fabrizio Falchi; Giuseppe Amato; Luca Ciampi; Marco Di Benedetto

Institution

ISTI-CNR;

Abstract

In many working and recreational activities, there are scenarios where both individual and collective safety have to be constantly checked and properly signaled, as occurring in dangerous workplaces or during pandemic events like the recent COVID-19 disease. From wearing personal protective equipment to filling physical spaces with an adequate number of people, it is clear that a possibly automatic solution would help to check compliance with the established rules. Based on an off-the-shelf compact and low-cost hardware, we present a deployed real use-case embedded system capable of perceiving people’s behaviour and aggregations and supervising the appliance of a set of rules relying on a configurable plug-in framework. Working on indoor and outdoor environments, we show that our implementation of counting people aggregations, measuring their reciprocal physical distances, and checking the proper usage of protective equipment is an effective yet open framework for monitoring human activities in critical conditions.

Access

Open Access

Type of Publication

Journal article

Publisher

N/A

access here

26/07/2022

Public opinion monitoring through collective semantic analysis of tweets

access here

Author(s)

Ioannis Mademlis; Ioannis Pitas;

Institution

Aristotle University of Thessaloniki;

Abstract

The high popularity of Twitter renders it an excellent tool for political research, while opinion mining through semantic analysis of individual tweets has proven valuable. However, exploiting relevant scientific advances for collective analysis of Twitter messages in order to quantify general public opinion has not been explored. This paper presents such a novel, automated public opinion monitoring mechanism , consisting of a semantic descriptor that relies on Natural Language Processing (NLP) algorithms. A four-dimensional descriptor is first extracted for each tweet independently, quantifying text polarity, offensiveness, bias and figurativeness. Subsequently, it is summarized across multiple tweets, according to a desired aggregation strategy and aggregation target. This can then be exploited in various ways, such as training machine learning models for forecasting day-by-day public opinion predictions. The proposed mechanism is applied to the 2016/2020 US Presidential Elections tweet datasets and the resulting succinct public opinion descriptions are explored as a case study.

Access

Open Access

Type of Publication

Journal article

Publisher

Social Network Analysis and Mining

access here

23/07/2022

Federated Unlearning: How to Efficiently Erase a Client in FL?

access here

Author(s)

Ambrish Rawat; Anisa Halimi; Nathalie Baracaldo; Swanand Kadhe;

Institution

IBM Research;

Abstract

With privacy legislation empowering users with the right to be forgotten, it has become essential to make a model forget about some of its training data. We explore the problem of removing any client’s contribution in federated learning (FL). During FL rounds, each client performs local training to learn a model that minimizes the empirical loss on their private data. We propose to perform unlearning at the client (to be erased) by reversing the learning process, i.e., training a model to \emph{maximize} the local empirical loss. In particular, we formulate the unlearning problem as a constrained maximization problem by restricting to an $\ell_2$-norm ball around a suitably chosen reference model to help retain some knowledge learnt from the other clients’ data. This allows the client to use projected gradient descent to perform unlearning. The method neither requires global access to the data used for training nor the history of the parameter updates to be stored by the aggregator (server) or any of the clients. Experiments on the MNIST dataset show that the proposed unlearning method is efficient and effective.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

22/07/2022

Learning to Detect Fallen People in Virtual Worlds

access here

Author(s)

Claudio Gennaro; Fabio Carrara; Fabrizio Falchi; Lorenzo Pasco

Institution

ISTI-CNR; University of Pisa

Abstract

Falling is one of the most common causes of injury in all ages, especially in the elderly, where it is more frequent and severe. For this reason, a tool that can detect a fall in real time can be helpful in ensuring appropriate intervention and avoiding more serious damage. Some approaches available in the literature use sensors, wearable devices, or cameras with special features such as thermal or depth sensors. In this paper, we propose a Computer Vision deep-learning based approach for human fall detection based on largely available standard RGB cameras. A typical limitation of this kind of approaches is the lack of generalization to unseen environments. This is due to the error generated during human detection and, more generally, due to the unavailability of large-scale datasets that specialize in fall detection problems with different environments and fall types. In this work, we mitigate these limitations with a general-purpose object detector trained using a virtual world dataset in addition to real-world images. Through extensive experimental evaluation, we verified that by training our models on synthetic images as well, we were able to improve their ability to generalize. Code to reproduce results is available at https://github.com/lorepas/fallen-people-detection.

Access

Open Access

Type of Publication

Conference paper

Publisher

Conference on Content-based Multimedia Indexing

access here

21/07/2022

Behavioral impulsivity is associated with pupillary alterations and hyperactivity in CDKL5 mutant mice

access here

Author(s)

Aurelia Viglione; Elena Putignano; Fabio Carrara; Giulia Sagona; Giuseppe Amato; Leonardo Lupori; Raffaele Mazziotti; Valentino Totaro;

Institution

BIO@SNS Lab; IRCCS Stella Maris Foundation; ISTI-CNR; Italy National Research Council;

Abstract

Cyclin-dependent kinase-like 5 (Cdkl5) deficiency disorder (CDD) is a severe neurodevelopmental condition caused by mutations in the X-linked Cdkl5 gene. CDD is characterized by early-onset seizures in the first month of life, intellectual disability, motor and social impairment. No effective treatment is currently available and medical management is only symptomatic and supportive. Recently, mouse models of Cdkl5 disorder have demonstrated that mice lacking Cdkl5 exhibit autism-like phenotypes, hyperactivity and dysregulations of the arousal system, suggesting the possibility to use these features as translational biomarkers. In this study, we tested Cdkl5 male and female mutant mice in an appetitive operant conditioning chamber to assess cognitive and motor abilities, and performed pupillometry to assess the integrity of the arousal system. Then, we evaluated the performance of artificial intelligence models to classify the genotype of the animals from the behavioral and physiological phenotype. The behavioral results show that CDD mice display impulsivity, together with low levels of cognitive flexibility and perseverative behaviors. We assessed arousal levels by simultaneously recording pupil size and locomotor activity. Pupillometry reveals in CDD mice a smaller pupil size and an impaired response to unexpected stimuli associated with hyperlocomotion, demonstrating a global defect in arousal modulation. Finally, machine learning reveals that both behavioral and pupillometry parameters can be considered good predictors of CDD. Since early diagnosis is essential to evaluate treatment outcomes and pupillary measures can be performed easily, we proposed the monitoring of pupil size as a promising biomarker for CDD.

Access

Open Access

Type of Publication

Journal article

Publisher

N/A

access here

14/07/2022

DnS: Distill-and-Select for Efficient and Accurate Video Indexing and Retrieval

access here

Author(s)

Christos Tzelepis; Giorgios Kordopatis-Zilos; Ioannis Kompatsiaris; Ioannis Patras; Symeon Papadopoulos

Institution

Abstract

In this paper, we address the problem of high performance and computationally efficient content-based video retrieval in large-scale datasets. Current methods typically propose either: (i) fine-grained approaches employing spatio-temporal representations and similarity calculations, achieving high performance at a high computational cost or (ii) coarse-grained approaches representing/indexing videos as global vectors, where the spatio-temporal structure is lost, providing low performance but also having low computational cost. In this work, we propose a Knowledge Distillation framework, called Distill-and-Select (DnS), that starting from a well-performing fine-grained Teacher Network learns: a) Student Networks at different retrieval performance and computational efficiency trade-offs and b) a Selector Network that at test time rapidly directs samples to the appropriate student to maintain both high retrieval performance and high computational efficiency. We train several students with different architectures and arrive at different trade-offs of performance and efficiency, i.e., speed and storage requirements, including fine-grained students that store/index videos using binary representations. Importantly, the proposed scheme allows Knowledge Distillation in large, unlabelled datasets — this leads to good students. We evaluate DnS on five public datasets on three different video retrieval tasks and demonstrate a) that our students achieve state-of-the-art performance in several cases and b) that the DnS framework provides an excellent trade-off between retrieval performance, computational speed, and storage space. In specific configurations, the proposed method achieves similar mAP with the teacher but is 20 times faster and requires 240 times less storage space. The collected dataset and implementation are publicly available: https://github.com/mever-team/distill-and-select.

Access

Open Access

Type of Publication

Journal article

Publisher

N/A

access here

10/07/2022

A Graph Diffusion Scheme for Decentralized Content Search based on Personalized PageRank

access here

Author(s)

Ioannis Kompatsiaris; Nikolaos Giatsoglou; Symeon Papadopoulos

Institution

CERTH - Center for Research and Technology Hellas

Abstract

Decentralization is emerging as a key feature of the future Internet. However, effective algorithms for search are missing from state-of-the-art decentralized technologies, such as distributed hash tables and blockchain. This is surprising, since decentralized search has been studied extensively in earlier peer- to-peer (P2P) literature. In this work, we adopt a fresh outlook for decentralized search in P2P networks that is inspired by advancements in dense information retrieval and graph signal processing. In particular, we generate latent representations of P2P nodes based on their stored documents and diffuse them to the rest of the network with graph filters, such as person- alized PageRank. We then use the diffused representations to guide search queries towards relevant content. Our preliminary approach is successful in locating relevant documents in nearby nodes but the accuracy declines sharply with the number of stored documents, highlighting the need for more sophisticated techniques.

Access

Open Access

Type of Publication

Conference paper

Publisher

Decentralized Internet, Networks, Protocols, and Systems

access here

08/07/2022

Game State Learning via Game Scene Augmentation

access here

Author(s)

Antonios Liapis; Chintan Triverdi; Georgios N. Yannakakis; Konstantinos Makantasis;

Institution

University of Malta

Abstract

Having access to accurate game state information is of utmost importance for any artificial intelligence task including game-playing, testing, player modeling, and procedural content generation. Self-Supervised Learning (SSL) techniques have shown to be capable of inferring accurate game state information from the high-dimensional pixel input of game footage into compressed latent representations. Contrastive Learning is a popular SSL paradigm where the visual understanding of the game’s images comes from contrasting dissimilar and similar game states defined by simple image augmentation methods. In this study, we introduce a new game scene augmentation technique—named GameCLR—that takes advantage of the game-engine to define and synthesize specific, highly-controlled renderings of different game states, thereby, boosting contrastive learning performance. We test our GameCLR technique on images of the CARLA driving simulator environment and compare it against the popular SimCLR baseline SSL method. Our results suggest that GameCLR can infer the game’s state information from game footage more accurately compared to the baseline. Our proposed approach allows us to conduct game artificial intelligence research by directly utilizing screen pixels as input.

Access

Open Access

Type of Publication

Conference paper

Publisher

Conference on the Foundations of Digital Games

access here

04/07/2022

Active Learning and the Saerens-Latinne-Decaestecker Algorithm: An Evaluation

access here

Author(s)

Alessio Molinari; Andrea Esuli; Fabrizio Sebastiani;

Institution

ISTI-CNR;

Abstract

The Saerens-Latinne-Decaestecker (SLD) algorithm is a method whose goal is improving the quality of the posterior probabilities (or simply “posteriors”) returned by a probabilistic classifier in scenarios characterized by prior probability shift (PPS) between the training set and the unlabelled (“test”) set. This is an important task, (a) because posteriors are of the utmost importance in downstream tasks such as, e.g., multiclass classification and cost-sensitive classification, and (b) because PPS is ubiquitous in many applications. In this paper we explore whether using SLD can indeed improve the quality of posteriors returned by a classifier trained via active learning (AL), a class of machine learning (ML) techniques that indeed tend to generate substantial PPS. Specifically, we target AL via relevance sampling (ALvRS) and AL via uncertainty sampling (ALvUS), two AL techniques that are very well-known especially because, due to their low computational cost, are suitable to being applied in scenarios characterized by large datasets. We present experimental results obtained on the RCV1-v2 dataset, showing that SLD fails to deliver better-quality posteriors with both ALvRS and ALvUS, thus contradicting previous findings in the literature, and that this is due not to the amount of PPS that these techniques generate, but to how the examples they prioritize for annotation are distributed.

Access

Open Access

Type of Publication

Conference paper

Publisher

Information Retrieval Communities in Europe Conference

access here

01/07/2022

Unsupervised High-Resolution Portrait Gaze Correction and Animation

access here

Author(s)

Enver Sangineto; Hao Tang; Jichao Zhang; Jimgjing Chen; Nicu Sebe; Peng Wu; Yan Yan

Institution

University of Trento;

Abstract

This paper proposes a gaze correction and animation method for high-resolution, unconstrained portrait images, which can be trained without the gaze angle and the head pose annotations. Common gaze-correction methods usually require annotating training data with precise gaze, and head pose information. Solving this problem using an unsupervised method remains an open problem, especially for high-resolution face images in the wild, which are not easy to annotate with gaze and head pose labels. To address this issue, we first create two new portrait datasets: CelebGaze (256 × 256) and highresolution CelebHQGaze (512 × 512). Second, we formulate the gaze correction task as an image inpainting problem, addressed using a Gaze Correction Module (GCM) and a Gaze Animation Module (GAM). Moreover, we propose an unsupervised training strategy, i.e., Synthesis-As-Training, to learn the correlation between the eye region features and the gaze angle. As a result, we can use the learned latent space for gaze animation with semantic interpolation in this space. Moreover, to alleviate both the memory and the computational costs in the training and the inference stage, we propose a Coarse-to-Fine Module (CFM) integrated with GCM and GAM. Extensive experiments validate the effectiveness of our method for both the gaze correction and the gaze animation tasks in both low and high-resolution face datasets in the wild and demonstrate the superiority of our method with respect to the state of the art.

Access

Open Access

Type of Publication

Journal article

Publisher

IEEE Transactions on Image Processing

access here

29/06/2022

Measuring Fairness under Unawareness of Sensitive Attributes: A Quantification-Based Approach

access here

Author(s)

Alejandro Moreo; Andrea Esuli; Fabrizio Sebastiani;

Institution

ISTI-CNR; University of Padova

Abstract

Algorithms and models are increasingly deployed to inform decisions about people, inevitably affecting their lives. As a consequence, those in charge of developing these models must carefully evaluate their impact on different groups of people and favour group fairness, that is, ensure that groups determined by sensitive demographic attributes, such as race or sex, are not treated unjustly. To achieve this goal, the availability (awareness) of these demographic attributes to those evaluating the impact of these models is fundamental. Unfortunately, collecting and storing these attributes is often in conflict with industry practices and legislation on data minimisation and privacy. For this reason, it can be hard to measure the group fairness of trained models, even from within the companies developing them. In this work, we tackle the problem of measuring group fairness under unawareness of sensitive attributes, by using techniques from quantification, a supervised learning task concerned with directly providing group-level prevalence estimates (rather than individual-level class labels). We show that quantification approaches are particularly suited to tackle the fairness-under-unawareness problem, as they are robust to inevitable distribution shifts while at the same time decoupling the (desirable) objective of measuring group fairness from the (undesirable) side effect of allowing the inference of sensitive attributes of individuals. More in detail, we show that fairness under unawareness can be cast as a quantification problem and solved with proven methods from the quantification literature. We show that these methods outperform previous approaches to measure demographic parity in five experimental protocols, corresponding to important challenges that complicate the estimation of classifier fairness under unawareness.

Access

Open Access

Type of Publication

Working paper

Publisher

Journal of Artificial Intelligence Research

access here

27/06/2022

Summarizing Videos using Concentrated Attention and Considering the Uniqueness and Diversity of the Video Frames

access here

Author(s)

Evlampios Apostolidis; Georgios Balaouras; Ioannis Patras; Vasileios Mezaris

Institution

CERTH - Center for Research and Technology Hellas QMUL

Abstract

In this work, we describe a new method for unsupervised video summarization. To overcome limitations of existing unsupervised video summarization approaches, that relate to the unstable training of Generator-Discriminator architectures, the use of RNNs for modeling long-range frames’ dependencies and the ability to parallelize the training process of RNN-based network architectures, the developed method relies solely on the use of a self-attention mechanism to estimate the importance of video frames. Instead of simply modeling the frames’ dependencies based on global attention, our method integrates a concentrated attention mechanism that is able to focus on non-overlapping blocks in the main diagonal of the attention matrix, and to enrich the existing information by extracting and exploiting knowledge about the uniqueness and diversity of the associated frames of the video. In this way, our method makes better estimates about the significance of different parts of the video, and drastically reduces the number of learnable parameters. Experimental evaluations using two benchmarking datasets (SumMe and TVSum) show the competitiveness of the proposed method against other state-of-the-art unsupervised summarization approaches, and demonstrate its ability to produce video summaries that are very close to the human preferences. An ablation study that focuses on the introduced components, namely the use of concentrated attention in combination with attention-based estimates about the frames’ uniqueness and diversity, shows their relative contributions to the overall summarization performance.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

27/06/2022

Automatic Detection of Bot-generated Tweets

access here

Author(s)

Adrian Popescu; Babacar Sow; Julien Tourille;

Institution

Université Clermont Auvergne; Université Paris-Saclay;

Abstract

Deep neural networks have the capacity to generate textual content which is increasingly difficult to distinguish from that produced by humans. Such content can be used in disinformation campaigns and its detrimental effects are amplified if it spreads on social networks. Here, we study the automatic detection of bot-generated Twitter messages. This task is difficult due to combination between the strong performance of recent deep language models and the limited length of tweets. In this study, we propose a challenging definition of the problem by making no assumption regarding the bot account, its network or the method used to generate the text. We devise two approaches for bot detection based on pretrained language models and create a new dataset of generated tweets to improve the performance of our classifier on recent text generation algorithms. The obtained results show that the generalization capabilities of the proposed classifier heavily depends on the dataset used to trained the model. Interestingly, the two automatic dataset augmentation proposed here show promising results. Their introduction leads to consistent performance gains compared to the use of the original dataset alone.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

27/06/2022

Cross-Forgery Analysis of Vision Transformers and CNNs for Deepfake Image Detection

access here

Author(s)

Claudio Gennaro; Davide Alessandro Coccomini; Fabrizio Falchi; Giuseppe Amato; Roberto Cardelli;

Institution

CNIT; ISTI-CNR;

Abstract

Access

Open Access

Type of Publication

Conference paper

Publisher

Conference on Multimedia Retrieval

access here

24/06/2022

Multi-Camera Vehicle Counting Using Edge-AI

access here

Author(s)

Claudio Gennaro; Claudio Vairo; Fabio Carrara; Fabrizio Falchi; Giuseppe Amato; Luca Ciampi;

Institution

ISTI-CNR;

Abstract

This paper presents a novel solution to automatically count vehicles in a parking lot using images captured by smart cameras. Unlike most of the literature on this task, which focuses on the analysis of single images, this paper proposes the use of multiple visual sources to monitor a wider parking area from different perspectives. The proposed multi-camera system is capable of automatically estimating the number of cars present in the entire parking lot directly on board the edge devices. It comprises an on-device deep learning-based detector that locates and counts the vehicles from the captured images and a decentralized geometric-based approach that can analyze the inter-camera shared areas and merge the data acquired by all the devices. We conducted the experimental evaluation on an extended version of the CNRPark-EXT dataset, a collection of images taken from the parking lot on the campus of the National Research Council (CNR) in Pisa, Italy. We show that our system is robust and takes advantage of the redundant information deriving from the different cameras, improving the overall performance without requiring any extra geometrical information of the monitored scene.

Access

Open Access

Type of Publication

Journal article

Publisher

Expert Systems with Applications

access here

22/06/2022

Novel Class Discovery in Semantic Segmentation

access here

Author(s)

Gim Hee Lee; Nicu Sebe; Yuyang Zhao; Zhun Zhong

Institution

National University of Singapore; University of Trento;

Abstract

We introduce a new setting of Novel Class Discovery in Semantic Segmentation (NCDSS), which aims at segmenting unlabeled images containing new classes given prior knowledge from a labeled set of disjoint classes. In contrast to existing approaches that look at novel class discovery in image classification, we focus on the more challenging semantic segmentation. In NCDSS, we need to distinguish the objects and background, and to handle the existence of multiple classes within an image, which increases the difficulty in using the unlabeled data. To tackle this new setting, we leverage the labeled base data and a saliency model to coarsely cluster novel classes for model training in our basic framework. Additionally, we propose the Entropy-based Uncertainty Modeling and Self-training (EUMS) framework to overcome noisy pseudo-labels, further improving the model performance on the novel classes. Our EUMS utilizes an entropy ranking technique and a dynamic reassignment to distill clean labels, thereby making full use of the noisy data via self-supervised learning. We build the NCDSS benchmark on the PASCAL-5i dataset and COCO-20i dataset. Extensive experiments demonstrate the feasibility of the basic framework (achieving an average mIoU of 49.81% on PASCAL-5i) and the effectiveness of EUMS framework (outperforming the basic framework by 9.28% mIoU on PASCAL-5i).

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

17/06/2022

ICS: Total Freedom in Manual Text Classification Supported by Unobtrusive Machine Learning

access here

Author(s)

Andrea Esuli;

Institution

ISTI-CNR;

Abstract

We present the Interactive Classification System (ICS), a web-based application that supports the activity of manual text classification. The application uses machine learning to continuously fit automatic classification models that are in turn used to actively support its users with classification suggestions. The key requirement we have established for the development of ICS is to give its users total freedom of action: they can at any time modify any classification schema and any label assignment, possibly reusing any relevant information from previous activities. We investigate how this requirement challenges the typical scenarios faced in machine learning research, which instead give no active role to humans or place them into very constrained roles, e.g., on-demand labeling in active learning processes, and always assume some degree of batch processing of data. We satisfy the “total freedom” requirement by designing an unobtrusive machine learning model, i.e., the machine learning component of ICS acts as an unobtrusive observer of the users, that never interrupts them, continuously adapts and updates its models in response to their actions, and it is always available to perform automatic classifications. Our efficient implementation of the unobtrusive machine learning model combines various machine learning methods and technologies, such as hash-based feature mapping, random indexing, online learning, active learning, and asynchronous processing.

Access

Open Access

Type of Publication

Journal article

Publisher

IEEE Access

access here

14/06/2022

Effects of Emotions on Head Motion Predictability in 360° Videos

access here

Author(s)

Lucile Sassatelli; Quentin Guimard

Institution

Université Côte d'Azur;

Abstract

While 360° videos watched in a VR headset are gaining in popularity, it is necessary to lower the required bandwidth to stream these immersive videos and obtain a satisfying quality of experience. Doing so requires predicting the user’s head motion in advance, which has been tackled by a number of recent prediction methods considering the video content and the user’s past motion. However, human motion is a complex process that can depend on many more parameters, including the type of attentional phase the user is currently in, and their emotions, which can be difficult to capture. This is the first article to investigate the effects of user emotions on the predictability of head motion, in connection with video-centric parameters. We formulate and verify hypotheses, and construct a structural equation model of emotion, motion and predictability. We show that the prediction error is higher for higher valence ratings, and that this relationship is mediated by head speed. We also show that the prediction error is lower for higher arousal, but that spatial information moderates the effect of arousal on predictability. This work opens the path to better capture important factors in human motion, to help improve the training process of head motion predictors.

Access

Open Access

Type of Publication

Conference paper

Publisher

International Workshop on Immersive Mixed and Virtual Environment Systems

access here

14/06/2022

PEM360: A dataset of 360° videos with continuous Physiological measurements, subjective Emotional ratings and Motion traces

access here

Author(s)

Aldric Ducreux; Auriane Gros; Camille Bauce; Florent Robert; Hui-Yin Wu; Lucile Sassatelli; Marco Wincler; Quentin Guimard

Institution

Université Côte d'Azur;

Abstract

From a user perspective, immersive content can elicit more intense emotions than flat-screen presentations. From a system perspective, efficient storage and distribution remain challenging, and must consider user attention. Understanding the connection between user attention, user emotions and immersive content is therefore key. In this article, we present a new dataset, PEM360 of user head movements and gaze recordings in 360° videos, along with self-reported emotional ratings of valence and arousal, and continuous physiological measurement of electrodermal activity and heart rate. The stimuli are selected to enable the spatiotemporal analysis of the connection between content, user motion and emotion. We describe and provide a set of software tools to process the various data modalities, and introduce a joint instantaneous visualization of user attention and emotion we name Emotional maps. We exemplify new types of analyses the PEM360 dataset can enable. The entire data and code are made available in a reproducible framework.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

13/06/2022

Learning Task-Independent Game State Representations from Unlabeled Images

access here

Author(s)

Antonios Liapis; Chintan Triverdi; Georgios N. Yannakakis; Konstantinos Makantasis;

Institution

University of Malta

Abstract

Self-supervised learning (SSL) techniques have been widely used to learn compact and informative representations from high-dimensional complex data. In many computer vision tasks, such as image classification, such methods achieve state-of-the-art results that surpass supervised learning approaches. In this paper, we investigate whether SSL methods can be leveraged for the task of learning accurate state representations of games, and if so, to what extent. For this purpose, we collect game footage frames and corresponding sequences of games’ internal state from three different 3D games: VizDoom, the CARLA racing simulator and the Google Research Football Environment. We train an image encoder with three widely used SSL algorithms using solely the raw frames, and then attempt to recover the internal state variables from the learned representations. Our results across all three games showcase significantly higher correlation between SSL representations and the game’s internal state compared to pre-trained baseline models such as ImageNet. Such findings suggest that SSL-based visual encoders can yield general — not tailored to a specific task — yet informative game representations solely from game pixel information. Such representations can, in turn, form the basis for boosting the performance of downstream learning tasks in games, including gameplaying, content generation and player modeling.

Access

Open Access

Type of Publication

Conference paper

Publisher

IEEE Conference on Games

access here

10/06/2022

VISIONE at Video Browser Showdown 2022

access here

Author(s)

Claudio Gennaro; Claudio Vairo; Fabio Carrara; Fabrizio Falchi; Giuseppe Amato; Lucia Vadicamo; Nicola Messina; Paolo Bolettieri;

Institution

ISTI-CNR;

Abstract

VISIONE is a content-based retrieval system that supports various search functionalities (text search, object/color-based search, semantic and visual similarity search, temporal search). It uses a full-text search engine as a search backend. In the latest version of our system, we modified the user interface, and we made some changes to the techniques used to analyze and search for videos.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

04/06/2022

Spectral Denoising for Microphone Classification

access here

Author(s)

Antonio Giganti; Luca Cuccovillo; Paolo Bestagini; Patrick Aichroth; Stefano Tubaro;

Institution

Fraunhofer IDMT; Politecnico di Milano;

Abstract

In this paper, we propose the use of denoising for microphone classification, to enable its usage for several key application domains that involve noisy conditions. We describe the proposed analysis pipeline and the baseline algorithm for microphone classification, and discuss various denoising approaches which can be applied to it within the time or spectral domain; finally, we determine the best-performing denoising procedure, and evaluate the performance of the overall, integrated approach with several SNR levels of additive input noise. As a result, the proposed method achieves an average accuracy increase of about 25% on denoised content over the reference baseline.

Access

Open Access

Type of Publication

Conference paper

Publisher

International Workshop on Multimedia AI against Disinformation

access here

01/06/2022

Report on the 1st International Workshop on Learning to Quantify (LQ 2021)

access here

Author(s)

Alejandro Moreo; Fabrizio Sebastiani; Juan José del Coz;

Institution

ISTI-CNR; University of Oviedo

Abstract

The 1st International Workshop on Learning to Quantify (LQ 2021 – https://cikmlq2021.github.io/), organized as a satellite event of the 30th ACM International Confer- ence on Knowledge Management (CIKM 2021), took place on two separate days, November 1 and 5, 2021. As the main CIKM 2021 conference, the workshop was held entirely on- line, due to the COVID-19 pandemic. This report presents a summary of each keynote speech and contributed paper presented in this event, and discusses the issues that were raised during the workshop.

Access

Open Access

Type of Publication

Journal article

Publisher

SIGKDD Exploration

access here

01/06/2022

Curriculum Learning: A Survey

access here

Author(s)

Nicu Sebe; Paolo Rota; Petru Soviany; Radu Tudor Ionescu

Institution

University of Trento; University Politehnica of Bucharest

Abstract

Training machine learning models in a meaningful order, from the easy samples to the hard ones, using curriculum learning can provide performance improvements over the standard training approach based on random data shuffling, without any additional computational costs. Curriculum learning strategies have been successfully employed in all areas of machine learning, in a wide range of tasks. However, the necessity of finding a way to rank the samples from easy to hard, as well as the right pacing function for introducing more difficult data can limit the usage of the curriculum approaches. In this survey, we show how these limits have been tackled in the literature, and we present different curriculum learning instantiations for various tasks in machine learning. We construct a multi-perspective taxonomy of curriculum learning approaches by hand, considering various classification criteria. We further build a hierarchical tree of curriculum learning methods using an agglomerative clustering algorithm, linking the discovered clusters with our taxonomy. At the end, we provide some interesting directions for future work.

Access

Closed Access

Type of Publication

Journal article

Publisher

International Journal of Computer Vision

access here

27/05/2022

Learning to count biological structures with raters’ uncertainty

access here

Author(s)

Carlos Santiago; Claudio Gennaro; Fabio Carrara; Giuseppe Amato; Leonardo Lupori; Luca Ciampi; Raffaele Mazziotti; Tommaso Pizzorusso Valentino Totaro;

Institution

Institute for Systems and Robotics; ISTI-CNR; University of Florence;

Abstract

Exploiting well-labeled training sets has led deep learning models to astonishing results for counting biological structures in microscopy images. However, dealing with weak multi-rater annotations, i.e., when multiple human raters disagree due to non-trivial patterns, remains a relatively unexplored problem. More reliable labels can be obtained by aggregating and averaging the decisions given by several raters to the same data. Still, the scale of the counting task and the limited budget for labeling prohibit this. As a result, making the most with small quantities of multi-rater data is crucial. To this end, we propose a two-stage counting strategy in a weakly labeled data scenario. First, we detect and count the biological structures; then, in the second step, we refine the predictions, increasing the correlation between the scores assigned to the samples and the raters’ agreement on the annotations. We assess our methodology on a novel dataset comprising fluorescence microscopy images of mice brains containing extracellular matrix aggregates named perineuronal nets. We demonstrate that we significantly enhance counting performance, improving confidence calibration by taking advantage of the redundant information characterizing the small sets of available multi-rater data.

Access

Open Access

Type of Publication

Journal article

Publisher

Medical Image Analysis

access here

27/05/2022

Exploiting Caption Diversity for Unsupervised Video Summarization

access here

Author(s)

Ioannis Mademlis; Ioannis Pitas; Michail Kaseris

Institution

Aristotle University of Thessaloniki;

Abstract

Most unsupervised Deep Neural Networks (DNNs) for video summarization rely on adversarial learning, autoencoding and training without utilizing any ground-truth summary. In several cases, the Convolutional Neural Network (CNN)-derived video frame representations are sequentially fed to a Long Short-Term Memory (LSTM) network, which selects key frames and, during training, attempts to reconstruct the original/full video from the summary, while confusing an adversarially optimized Discriminator. Additionally, regularizers aiming at maximizing the summary’s visual semantic diversity can be employed, such as the Determinantal Point Process (DPP) loss term. In this paper, a novel DPP-based regularizer is proposed that exploits a pretrained DNN-based image captioner in order to additionally enforce maximal keyframe diversity from the perspective of textual semantic content. Thus, the selected key-frames are encouraged to differ not only with regard to what objects they depict, but also with regard to their textual descriptions, which may additionally capture activities, scene context, etc. Empirical evaluation indicates that the proposed regularizer leads to state-of-the-art performance.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

24/05/2022

LANBIQUE: LANguage-based Blind Image QUality Evaluation

access here

Author(s)

Alberto Del Bimbo; Leonardo Galteri; Lorenzo Seidenari; Marco Bertini; Pietro Bongini

Institution

University of Florence;

Abstract

Image quality assessment is often performed with deep networks which are fine-tuned to regress a human provided quality score of a given image. Usually, this approaches may lack generalization capabilities and, while being highly precise on similar image distribution, it may yield lower correlation on unseen distortions. In particular they show poor performances whereas images corrupted by noise, blur or compressed have been restored by generative models. As a matter of fact, evaluation of these generative models is often performed providing anecdotal results to the reader. In the case of image enhancement and restoration, reference images are usually available. Nonetheless, using signal based metrics often leads to counterintuitive results: highly natural crisp images may obtain worse scores than blurry ones. On the other hand, blind reference image assessment may rank images reconstructed with GANs higher than the original undistorted images. To avoid time consuming human based image assessment, semantic computer vision tasks may be exploited instead. In this paper we advocate the use of language generation tasks to evaluate the quality of restored images. We refer to our assessment approach as LANguage-based Blind Image QUality Evaluation (LANBIQUE). We show experimentally that image captioning, used as a downstream task, may serve as a method to score image quality, independently of the distortion process that affects the data. Captioning scores are better aligned with human rankings with respect to classic signal based or No-Reference image quality metrics. We show insights on how the corruption, by artifacts, of local image structure may steer image captions in the wrong direction.

Access

Open Access

Type of Publication

Journal article

Publisher

N/A

access here

15/05/2022

Deep Variational Learning for Multiple Trajectory Prediction of 360° Head Movements

access here

Author(s)

Alberto Del Bimbo; Federico Becattini; Francesco Marchetti; Lorenzo Seidenari; Lucile Sassatelli; Quentin Guimard

Institution

Università degli Studi di Firenze; Université Côte d'Azur;

Abstract

Prediction of head movements in immersive media is key to design efficient streaming systems able to focus the bandwidth budget on visible areas of the content. Numerous proposals have therefore been made in the recent years to predict 360° images and videos. However, the performance of these models is limited by a main characteristic of the head motion data: its intrinsic uncertainty. In this article, we present an approach to generate multiple plausible futures of head motion in 360° videos, given a common past trajectory. Our method provides likelihood estimates of every predicted trajectory, enabling direct integration in streaming optimization. To the best of our knowledge, this is the first work that considers the problem of multiple head motion prediction for 360° video streaming. We first quantify this uncertainty from the data. We then introduce our discrete variational multiple sequence (DVMS) learning framework, which builds on deep latent variable models. We design a training procedure to obtain a flexible and lightweight stochastic prediction model compatible with sequence-to-sequence recurrent neural architectures. Experimental results on 3 different datasets show that our method DVMS outperforms competitors adapted from the self-driving domain by up to 37% on prediction horizons up to 5 sec., at lower computational and memory costs. Finally, we design a method to estimate the respective likelihoods of the multiple predicted trajectories, by exploiting the stationarity of the distribution of the prediction error over the latent space. Experimental results on 3 datasets show the quality of these estimates, and how they depend on the video category.

Access

Open Access

Type of Publication

Conference paper

Publisher

ACM Multimedia Systems Conference

access here

13/05/2022

COCO, LVIS, Open Images V4 classes mapping

access here

Author(s)

Claudio Gennaro; Claudio Vairo; Fabio Carrara; Fabrizio Falchi; Giuseppe Amato; Lucia Vadicamo; Nicola Messina; Paolo Bolettieri;

Institution

ISTI-CNR;

Abstract

This repository contains a mapping between the classes of COCO, LVIS, and Open Images V4 datasets into a unique set of 1460 classes. COCO [Lin et al 2014] contains 80 classes, LVIS [gupta2019lvis] contains 1460 classes, Open Images V4 [Kuznetsova et al. 2020] contains 601 classes. We built a mapping of these classes using a semi-automatic procedure in order to have a unique final list of 1460 classes. We also generated a hierarchy for each class, using wordnet.

Access

Open Access

Type of Publication

Journal article

Publisher

Journal of Imaging

access here

11/05/2022

Forecasting Future Instance Segmentation with Learned Optical Flow and Warping

access here

Author(s)

Alberto Del Bimbo; Andrea Ciamarra Federico Becattini; Lorenzo Seidenari;

Institution

Università degli Studi di Firenze; University of Florence;

Abstract

For an autonomous vehicle it is essential to observe the ongoing dynamics of a scene and consequently predict imminent future scenarios to ensure safety to itself and others. This can be done using different sensors and modalities. In this paper we investigate the usage of optical flow for predicting future semantic segmentations. To do so we propose a model that forecasts flow fields autoregressively. Such predictions are then used to guide the inference of a learned warping function that moves instance segmentations onto future frames. Results on the Cityscapes dataset demonstrate the effectiveness of optical-flow methods.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

10/05/2022

MeVer NetworkX: Network Analysis and Visualization for Tracing Disinformation

access here

Author(s)

Francesco Poldi; Ioannis Kompatsiaris; Lazaros Apostolidis; Olga Papadopoulou; Symeon Papadopoulos Themistoklis Makedas

Institution

CERTH - Center for Research and Technology Hellas EU DisinfoLab

Abstract

The proliferation of online news, especially during the “infodemic” that emerged along with the COVID-19 pandemic, has rapidly increased the risk of and, more importantly, the volume of online misinformation. Online Social Networks (OSNs), such as Facebook, Twitter, and YouTube, serve as fertile ground for disseminating misinformation, making the need for tools for analyzing the social web and gaining insights into communities that drive misinformation online vital. We introduce the MeVer NetworkX analysis and visualization tool, which helps users delve into social media conversations, helps users gain insights about how information propagates, and provides intuition about communities formed via interactions. The contributions of our tool lie in easy navigation through a multitude of features that provide helpful insights about the account behaviors and information propagation, provide the support of Twitter, Facebook, and Telegram graphs, and provide the modularity to integrate more platforms. The tool also provides features that highlight suspicious accounts in a graph that a user should investigate further. We collected four Twitter datasets related to COVID-19 disinformation to present the tool’s functionalities and evaluate its effectiveness.

Access

Open Access

Type of Publication

Journal article

Publisher

Future Internet

access here

09/05/2022

A two-way street between AI research and media scholars

access here

Author(s)

Johan Oomen; Philo van Kemenade; Rasa Bocyte

Institution

Netherlands Institute for Sound & Vision

Abstract

With the rapid advance of Artificial Intelligence (AI), increased availability of digitised and born-digital sources from a wide range of collection owners, researchers can gain new perspectives on large-scale audiovisual collections and study patterns that reach across media and time. But what are the actual requirements that humanities scholars have for the use of such AI-based tooling? This question is what the Netherlands Institute for Sound & Vision brought into the European research project AI4Media. Specifically, NISV is investigating how AI tools could open new research possibilities for the users of the CLARIAH Media Suite virtual research environment which enables exploration and analysis of distributed audiovisual collections. In this short paper presentation, we will present the requirements gathered from humanities scholars on AI tooling and describe how they are being translated into functional AI tools in the Media Suite.

Access

Open Access

Type of Publication

Conference paper

Publisher

DH Benelux

access here

06/05/2022

Multi-Input Architecture and Disentangled Representation Learning for Multi-Dimensional Modeling of Music Similarity

access here

Author(s)

Hanna Lukashevich Jakob Abeßer Ribecky Sebastian

Institution

Fraunhofer IDMT;

Abstract

In the context of music information retrieval, similarity-based approaches are useful for a variety of tasks that benefit from a query-by-example approach. Music however, naturally decomposes into a set of semantically meaningful factors of variation. Current representation learning strategies pursue the disentanglement of such factors from deep representations, and result in highly interpretable models. This allows to model the perception of music similarity, which is highly subjective and multi-dimensional. While the focus of prior work is on metadata driven similarity, we suggest to directly model the human notion of multi-dimensional music similarity. To achieve this, we propose a multi-input deep neural network architecture, which simultaneously processes mel-spectrogram, CENSchromagram and tempogram representations in order to extract informative features for different disentangled musical dimensions: genre, mood, instrument, era, tempo, and key. We evaluated the proposed music similarity approach using a triplet prediction task and found that the proposed multi-input architecture outperforms a state of the art method. Furthermore, we present a novel multi-dimensional analysis to evaluate the influence of each disentangled dimension on the perception of music similarity.

Access

Open Access

Type of Publication

Publication

Publisher

N/A

access here

27/04/2022

The MeVer DeepFake Detection Service: Lessons Learnt from Developing and Deploying in the Wild

access here

Author(s)

Denis Teyssou; Giorgios Kordopatis-Zilos; Ioannis Kompatsiaris; Ipek B. Schlicht; Killian Levacher; Lazaros Apostolidis; Panagiotis Galopoulos; Spyridon Baxevanakis;

Institution

Agence France-Presse; CERTH - Center for Research and Technology Hellas Deutsche Welle; IBM Research;

Abstract

Enabled by recent improvements in generation methodologies, DeepFakes have become mainstream due to their increasingly better visual quality, the increase in easy-to-use generation tools and the rapid dissemination through social media. This fact poses a severe threat to our societies with the potential to erode social cohesion and influence our democracies. To mitigate the threat, numerous DeepFake detection schemes have been introduced in the literature but very few provide a web service that can be used in the wild. In this paper, we introduce the MeVer DeepFake detection service, a web service detecting deep learning manipulations in images and video. We present the design and implementation of the proposed processing pipeline that involves a model ensemble scheme, and we endow the service with a model card for transparency. Experimental results show that our service performs robustly on the three benchmark datasets while being vulnerable to Adversarial Attacks. Finally, we outline our experience and lessons learned when deploying a research system into production in the hopes that it will be useful to other academic and industry teams.

Access

Open Access

Type of Publication

Conference paper

Publisher

International Workshop on Multimedia AI against Disinformation

access here

23/04/2022

Data-driven personalisation of Television Content: A Survey

access here

Author(s)

Jeremy Foss; Konstantinos Apostolidis; Lyndon Nixon; Vasileios Mezaris

Institution

Birmingham City University; CERTH - Center for Research and Technology Hellas MODUL Technology;

Abstract

Access

Open Access

Type of Publication

Journal article

Publisher

ACM Multimedia Systems Conference

access here

19/04/2022

Dungeons & Replicants II: Automated Game Balancing Across Multiple Difficulty Dimensions via Deep Player Behavior Modeling

access here

Author(s)

Antonios Liapis; Georgios N. Yannakakis; Johannes Pfau; Rainer Malaka;

Institution

University of Bremen; University of Malta

Abstract

Video game testing has become a major investment of time, labor and expense in the game industry. Particularly the balancing of in-game units, characters and classes can cause long-lasting issues that persist years after a game’s launch. While approaches incorporating artificial intelligence have already shown successes in reducing manual effort and enhancing game development processes, most of these draw on heuristic, generalized or optimal behavior routines, while actual low-level decisions from individual players and their resulting playing styles are rarely considered. In this paper, we apply Deep Player Behavior Modeling to turn atomic actions of 213 players from 6 months of single-player instances within the MMORPG Aion into generative models that capture and reproduce particular playing strategies. In a subsequent simulation, the resulting generative agents (replicants) were tested against common NPC opponent types of MMORPGs that iteratively increased in difficulty, respective to the primary factor that constitutes this enemy type (Melee, Ranged, Rogue, Buffer, Debuffer, Healer, Tank or Group). As a result, imbalances between classes as well as strengths and weaknesses regarding particular combat challenges could be identified and regulated automatically.

Access

Open Access

Type of Publication

Journal article

Publisher

IEEE Transactions on Games

access here

14/04/2022

RankNEAT: Outperforming Stochastic Gradient Search in Preference Learning Tasks

access here

Author(s)

Antonios Liapis; Georgios N. Yannakakis; Konstantinos Makantasis; Kosmas Pinitas;

Institution

University of Malta

Abstract

Stochastic gradient descent (SGD) is a premium optimization method for training neural networks, especially for learning objectively defined labels such as image objects and events. When a neural network is instead faced with subjectively defined labels–such as human demonstrations or annotations–SGD may struggle to explore the deceptive and noisy loss landscapes caused by the inherent bias and subjectivity of humans. While neural networks are often trained via preference learning algorithms in an effort to eliminate such data noise, the de facto training methods rely on gradient descent. Motivated by the lack of empirical studies on the impact of evolutionary search to the training of preference learners, we introduce the RankNEAT algorithm which learns to rank through neuroevolution of augmenting topologies. We test the hypothesis that RankNEAT outperforms traditional gradient-based preference learning within the affective computing domain, in particular predicting annotated player arousal from the game footage of three dissimilar games. RankNEAT yields superior performances compared to the gradient-based preference learner (RankNet) in the majority of experiments since its architecture optimization capacity acts as an efficient feature selection mechanism, thereby, eliminating overfitting. Results suggest that RankNEAT is a viable and highly efficient evolutionary alternative to preference learning.

Access

Open Access

Type of Publication

Conference paper

Publisher

Genetic and Evolutionary Computation Conference

access here

05/04/2022

AI for the Media Industry: Application Potential and Automation Levels

access here

Author(s)

Georg Thallinger; Katharina Schell; Verena Krawarik; Victoria Ertelthalner; Werner Bailer;

Institution

Joanneum Research;

Abstract

Tools based on artificial intelligence (AI) are increasingly used in the media industry, addressing a potentially wide range of application areas. Based on a survey involving media professionals and technology providers, we present a taxonomy of application areas of AI in the media industry, including an assessment of the maturity of AI technology for the respective application. As many of these applications require human oversight, either due to insufficient maturity of technology or the need for editorial control, we also propose a classification of automation levels for AI in the media domain, with examples for different stages of the media value chain. Both of these aspects are strongly linked to the role of human users and their interaction with AI technologies. The results suggest that human-AI collaboration in media applications is still an unsolved research question.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

05/04/2022

Making Few-shot Object Detection Simpler and Less Frustrating

access here

Author(s)

Werner Bailer;

Institution

Joanneum Research;

Abstract

Few-shot object detection is useful in order to extend object detection capabilities in media production and archiving applications with specific object classes of interest for a particular organization or production context. While recent approaches for few-shot object detection have advanced the state of the art, they still do not fully meet the requirements of practical workflows, e.g., in media production and archiving. In these applications, annotated samples for novel classes are drawn from different data sources, they differ in numbers and it may be necessary to add a new class quickly to cover the requirements of a specific production. In contrast, current frameworks for few-shot object detection typically assume a static dataset, which is split into the base and novel classes. We propose a toolchain to facilitate training for few-shot object detection, which takes care of data preparation when using heterogeneous training data and setup of training steps. The toolchain also creates annotation files to use combined data sets as new base models, which facilitates class-incremental training. We also integrated the toolchain with an annotation UI.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

30/03/2022

Fast Differentiable Matrix Square Root

access here

Author(s)

Nicu Sebe; Wei Wang; Yue Song

Institution

University of Trento;

Abstract

Computing the matrix square root or its inverse in a differentiable manner is important in a variety of computer vision tasks. Previous methods either adopt the Singular Value Decomposition (SVD) to explicitly factorize the matrix or use the Newton-Schulz iteration (NS iteration) to derive the approximate solution. However, both methods are not computationally efficient enough in either the forward pass or in the backward pass. In this paper, we propose two more efficient variants to compute the differentiable matrix square root. For the forward propagation, one method is to use Matrix Taylor Polynomial (MTP), and the other method is to use Matrix Pad´e Approximants (MPA). The backward gradient is computed by iteratively solving the continuous-time Lyapunov equation using the matrix sign function. Both methods yield considerable speed-up compared with the SVD or the Newton-Schulz iteration. Experimental results on the de-correlated batch normalization and second-order vision transformer demonstrate that our methods can also achieve competitive and even slightly better performances. The code is available at https://github.com/KingJamesSong/FastDifferentiableMatSqrt.

Access

Open Access

Type of Publication

Conference paper

Publisher

International Conference on Learning Representations

access here

30/03/2022

Sparse to Dense Dynamic 3D Facial Expression Generation

access here

Author(s)

Alberto Del Bimbo; Claudio Ferrari; Mohamed Daoudi; Naima Otberdout; Stefano Berretti

Institution

University of Florence; University of Lille; University of Parma

Abstract

In this paper, we propose a solution to the task of generating dynamic 3D facial expressions from a neutral 3D face and an expression label. This involves solving two sub-problems: (i) modeling the temporal dynamics of expressions, and (ii) deforming the neutral mesh to obtain the expressive counterpart. We represent the temporal evolution of expressions using the motion of a sparse set of 3D landmarks that we learn to generate by training a manifold-valued GAN (Motion3DGAN). To better encode the expression-induced deformation and disentangle it from the identity information, the generated motion is represented as per-frame displacement from a neutral configuration. To generate the expressive meshes, we train a Sparse2Dense mesh Decoder (S2D-Dec) that maps the landmark displacements to a dense, per-vertex displacement. This allows us to learn how the motion of a sparse set of landmarks influences the deformation of the overall face surface, independently from the identity. Experimental results on the CoMA and D3DFACS datasets show that our solution brings significant improvements with respect to previous solutions in terms of both dynamic expression generation and mesh reconstruction, while retaining good generalization to unseen data. The code and the pretrained model will be made publicly available.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

30/03/2022

Multiple Future Prediction Leveraging Synthetic Trajectories

access here

Author(s)

Alberto Del Bimbo; Federico Becattini; Lorenzo Berlincioni; Lorenzo Seidenari;

Institution

University of Florence;

Abstract

Trajectory prediction is an important task, especially in autonomous driving. The ability to forecast the position of other moving agents can yield to an effective planning, ensuring safety for the autonomous vehicle as well for the observed entities. In this work we propose a data driven approach based on Markov Chains to generate synthetic trajectories, which are useful for training a multiple future trajectory predictor. The advantages are twofold: on the one hand synthetic samples can be used to augment existing datasets and train more effective predictors; on the other hand, it allows to generate samples with multiple ground truths, corresponding to diverse equally likely outcomes of the observed trajectory. We define a trajectory prediction model and a loss that explicitly address the multimodality of the problem and we show that combining synthetic and real data leads to prediction improvements, obtaining state of the art results.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

29/03/2022

VWFP: Virtual World Fallen People Dataset for Visual Fallen People Detection

access here

Author(s)

Claudio Gennaro; Fabio Carrara; Fabrizio Falchi; Lorenzo Pasco

Institution

ISTI-CNR; University of Pisa

Abstract

A synthetic dataset for visual fallen people detection comprising images extracted from the highly photo-realistic video game Grand Theft Auto V developed by Rockstar North. Each image is labeled by the game engine providing bounding boxes and statuses (fallen or non-fallen) of people present in the scene. The dataset comprises 6,071 synthetic images depicting 7,456 fallen and 26,125 non-fallen pedestrian instances in various looks, camera positions, background scenes, lightning, and occlusion conditions.

Access

Open Access

Type of Publication

Paper

Publisher

N/A

access here

23/03/2022

A Task Category Space for User-Centric Comparative Multimedia Search Evaluations

access here

Author(s)

Björn Þór Jónsson; Cathal Gurrin; Jakub Lokoč; Jiaxin Wu; Kai Uwe Barthel; Klaus Schoeffmann; Ladislav Peška; Luca Rossetto; Lucia Vadicamo; Silvan Heller; Stefanos Vrochidis; Werner Bailer;

Institution

CERTH - Center for Research and Technology Hellas Charles University; Dublin City University; HTW Berlin; Joanneum Research; University of Basel; University of Copenhagen; University of Hong Kong;

Abstract

In the last decade, user-centric video search competitions have facilitated the evolution of interactive video search systems. So far, these competitions focused on a small number of search task categories, with few attempts to change task category configurations. Based on our extensive experience with interactive video search contests, we have \mbox{analyzed} the spectrum of possible task categories and propose a list of individual axes that define a large space of possible task categories. Using this concept of category space, new user-centric video search competitions can be designed to benchmark video search systems from different perspectives. We further analyse the three task categories considered so far at the Video Browser Showdown and discuss possible (but sometimes challenging) shifts within the task category space.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

18/03/2022

LeQua@CLEF2022: Learning to Quantify

access here

Author(s)

Alejandro Moreo; Andrea Esuli; Fabrizio Sebastiani;

Institution

ISTI-CNR;

Abstract

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

17/03/2022

(Compress and Restore) N : a Robust Defense Against Adversarial Attacks on Image Classification

access here

Author(s)

Alberto Del Bimbo; Claudio Ferrari; Federico Becattini; Leonardo Galteri;

Institution

University of Florence; University of Parma

Abstract

Modern image classification approaches often rely on deep neural networks, which have shown pronounced weakness to adversarial examples: images corrupted with specifically designed yet imperceptible noise that causes the network to misclassify. In this paper, we propose a conceptually simple yet robust solution to tackle adversarial attacks on image classification. Our defense works by first applying a JPEG compression with a random quality factor; compression artifacts are subsequently removed by means of a generative model (AR-GAN). The process can be iterated ensuring the image is not degraded and hence the classification not compromised. We train different AR-GANs for different compression factors, so that we can change its parameters dynamically at each iteration depending on the current compression, making the gradient approximation difficult. We experiment our defense against three white-box and two black-box attacks, with a particular focus on the state-of-the-art BPDA attack. Our method does not require any adversarial training, and is independent of both the classifier and the attack. Experiments demonstrate that dynamically changing the AR-GAN parameters is of fundamental importance to obtain significant robustness.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

17/03/2022

Tweet Sentiment Quantification: An Experimental Re-Evaluation

access here

Author(s)

Alejandro Moreo; Fabrizio Sebastiani;

Institution

Italy National Research Council;

Abstract

Sentiment quantification is the task of training, by means of supervised learning, estimators of the relative frequency (also called “prevalence”) of sentiment-related classes (such as Positive, Neutral, Negative) in a sample of unlabelled texts. This task is especially important when these texts are tweets, since the final goal of most sentiment classification efforts carried out on Twitter data is actually quantification (and not the classification of individual tweets). It is well-known that solving quantification by means of “classify and count” (i.e., by classifying all unlabelled items by means of a standard classifier and counting the items that have been assigned to a given class) is less than optimal in terms of accuracy, and that more accurate quantification methods exist. Gao and Sebastiani~\cite{Gao:2016uq carried out a systematic comparison of quantification methods on the task of tweet sentiment quantification. In hindsight, we observe that the experimentation carried out in that work was weak, and that the reliability of the conclusions that were drawn from the results is thus questionable. We here re-evaluate those quantification methods (plus a few more modern ones) on exactly the same datasets, this time following a now consolidated and robust experimental protocol (which also involves simulating the presence, in the test data, of class prevalence values very different from those of the training set). This experimental protocol (even without counting the newly added methods) involves a number of experiments 5,775 times larger than that of the original study. Due to the above-mentioned presence, in the test data, of samples characterised by class prevalence values very different from those of the training set, the results of our experiments are dramatically different from those obtained by Gao and Sebastiani, and provide a different, much more solid understanding of the relative strengths and weaknesses of different sentiment quantification methods.

Access

Open Access

Type of Publication

Journal article

Publisher

N/A

access here

17/03/2022

The ImageCLEFAware 2021 Dataset

access here

Author(s)

Adrian Popescu; Bogdan Ionescu; Jérôme Deshayes-Chossart;

Institution

Université Paris-Saclay; University Politehnica of Bucharest

Abstract

Images constitute a large part of the content shared on social networks. Their disclosure is often related to a particular context and users are often unaware of the fact that, depending on their privacy status, images can be accessible to third parties and be used for purposes which were initially unforeseen. For instance, it is common practice for employers to search information about their future employees online. Another example of usage is that of automatic credit scoring based on online data. Most existing approaches which propose feedback about shared data focus on inferring user characteristics and their practical utility is rather limited. We hypothesize that user feedback would be more efficient if conveyed through the real-life effects of data sharing. The objective of the task is to automatically score user photographic profiles in a series of situations with strong impact on her/his life. Four such situations were modeled this year and refer to searching for: (1) a bank loan, (2) an accommodation, (3) a job as waitress/waiter and (4) a job in IT. The inclusion of several situations is interesting in order to make it clear to the end users of the system that the same image will be interpreted differently depending on the context. The final objective of the task is to encourage the development of efficient user feedback, such as the YDSYO Android app.

Access

Open Access

Type of Publication

Paper

Publisher

N/A

access here

16/03/2022

Geometry-Contrastive Transformer for Generalized 3D Pose Transfer

access here

Author(s)

Guoying Zhao Hao Tang; Haoyou Chen Nicu Sebe; Zitong Yu

Institution

ETH Zurich; University of Oulu; University of Trento;

Abstract

We present a customized 3D mesh Transformer model for the pose transfer task. As the 3D pose transfer essentially is a deformation procedure dependent on the given meshes, the intuition of this work is to perceive the geometric inconsistency between the given meshes with the powerful self-attention mechanism. Specifically, we propose a novel geometry-contrastive Transformer that has an efficient 3D structured perceiving ability to the global geometric inconsistencies across the given meshes. Moreover, locally, a simple yet efficient central geodesic contrastive loss is further proposed to improve the regional geometric-inconsistency learning. At last, we present a latent isometric regularization module together with a novel semi-synthesized dataset for the cross-dataset 3D pose transfer task towards unknown spaces. The massive experimental results prove the efficacy of our approach by showing state-of-the-art quantitative performances on SMPL-NPT, FAUST and our new proposed dataset SMG- 3D datasets, as well as promising qualitative results on MGcloth and SMAL datasets. It’s demonstrated that our method can achieve robust 3D pose transfer and be generalized to challenging meshes from unknown spaces on cross-dataset tasks. The code and dataset are made available. Code is available: https://github.com/mikecheninoulu/CGT.

Access

Open Access

Type of Publication

Conference paper

Publisher

Conference on Artificial Intelligence

access here

14/03/2022

p2pGNN: A Decentralized Graph Neural Network for Node Classification in Peer-to-Peer Networks

access here

Author(s)

Emmanouil Krasanakis; Symeon Papadopoulos

Institution

CERTH - Center for Research and Technology Hellas

Abstract

In this work, we aim to classify nodes of unstructured peer-to-peer networks with communication uncertainty, such as users of decentralized social networks. Graph Neural Networks (GNNs) are known to improve the accuracy of simpler classifiers in centralized settings by leveraging naturally occurring network links, but graph convolutional layers are challenging to implement in decentralized settings when node neighbors are not constantly available.We address this problem by employing decoupled GNNs, where base classifier predictions and errors are diffused through graphs after training. For these, we deploy pre-trained and gossip-trained base classifiers and implement peer-to-peer graph diffusion under communication uncertainty. In particular, we develop an asynchronous decentralized formulation of diffusion that converges at centralized predictions in distribution and linearly with respect to communication rates. We experiment on three real-world graphs with node features and labels and simulate peer-to-peer networks with uniformly random communication frequencies; given a portion of known labels, our decentralized graph diffusion achieves comparable accuracy to centralized GNNs with minimal communication overhead (less than 3% of what gossip training already adds).

Access

Open Access

Type of Publication

Journal article

Publisher

IEEE Access

access here

06/03/2022

solo-learn: A Library of Self-supervised Methods for Visual Representation Learning

access here

Author(s)

Elisa Ricci; Enrico Fini; Moin Nabi; Nicu Sebe; Victor da Costa

Institution

SAP AI Research University of Trento;

Abstract

This paper presents solo-learn, a library of self-supervised methods for visual representation learning. Implemented in Python, using Pytorch and Pytorch lightning, the library ts both research and industry needs by featuring distributed training pipelines with mixed-precision, faster data loading via Nvidia DALI, online linear evaluation for better prototyping, and many additional training tricks. Our goal is to provide an easy-to- use library comprising a large amount of Self-supervised Learning (SSL) methods, that can be easily extended and ne-tuned by the community. solo-learn opens up avenues for exploiting large-budget SSL solutions on inexpensive smaller infrastructures and seeks to democratize SSL by making it accessible to all. The source code is available at https://github.com/vturrisi/solo-learn.

Access

Open Access

Type of Publication

Journal article

Publisher

Journal of Machine Learning Research

access here

02/03/2022

Efficient Training of Visual Transformers with Small Datasets

access here

Author(s)

Bruno Lepri; Enver Sangineto; Marco de Nadai; Nicu Sebe; Wei Bi; Yahui Liu

Institution

FBK; Tencent AI Lab University of Trento;

Abstract

Visual Transformers (VTs) are emerging as an architectural paradigm alternative to Convolutional networks (CNNs). Differently from CNNs, VTs can capture global relations between image elements and they potentially have a larger representation capacity. However, the lack of the typical convolutional inductive bias makes these models more data hungry than common CNNs. In fact, some local properties of the visual domain which are embedded in the CNN architectural design, in VTs should be learned from samples. In this paper, we empirically analyse different VTs, comparing their robustness in a small training set regime, and we show that, despite having a comparable accuracy when trained on ImageNet, their performance on smaller datasets can be largely different. Moreover, we propose an auxiliary selfsupervised task which can extract additional information from images with only a negligible computational overhead. This task encourages the VTs to learn spatial relations within an image and makes the VT training much more robust when training data is scarce. Our task is used jointly with the standard (supervised) training and it does not depend on specific architectural choices, thus it can be easily plugged in the existing VTs. Using an extensive evaluation with different VTs and datasets, we show that our method can improve (sometimes dramatically) the final accuracy of the VTs. Our code is available at: https://github.com/ yhlleo/VTs-Drloc.

Access

Open Access

Type of Publication

Conference paper

Publisher

Conference on Neural Information Processing Systems

access here

22/02/2022

The Pixels and Sounds of Emotion: General-Purpose Representations of Arousal in Games

access here

Author(s)

Antonios Liapis; Georgios N. Yannakakis; Konstantinos Makantasis;

Institution

University of Malta

Abstract

What if emotion could be captured in a general and subject-agnostic fashion? Is it possible, for instance, to design general-purpose representations that detect affect solely from the pixels and audio of a human-computer interaction video? In this paper we address the above questions by evaluating the capacity of deep learned representations to predict affect by relying only on audiovisual information of videos. We assume that the pixels and audio of an interactive session embed the necessary information required to detect affect. We test our hypothesis in the domain of digital games and evaluate the degree to which deep classifiers and deep preference learning algorithms can learn to predict the arousal of players based only on the video footage of their gameplay. Our results from four dissimilar games suggest that general-purpose representations can be built across games as the arousal models obtain average accuracies as high as 85% using the challenging leave-one-video-out cross-validation scheme. The dissimilar audiovisual characteristics of the tested games showcase the strengths and limitations of the proposed method.

Access

Open Access

Type of Publication

Journal article

Publisher

IEEE Transactions on Affective Computing

access here

10/02/2022

Counting or Localizing? Evaluating Cell Counting and Detection in Microscopy Images

access here

Author(s)

Claudio Gennaro; Fabio Carrara; Giuseppe Amato; Luca Ciampi;

Institution

ISTI-CNR;

Abstract

Image-based automatic cell counting is an essential yet challenging task, crucial for the diagnosing of many diseases. Current solutions rely on Convolutional Neural Networks and provide astonishing results. However, their performance is often measured only considering counting errors, which can lead to masked mistaken estimations; a low counting error can be obtained with a high but equal number of false positives and false negatives. Consequently, it is hard to determine which solution truly performs best. In this work, we investigate three general counting approaches that have been successfully adopted in the literature for counting several different categories of objects. Through an experimental evaluation over three public collections of microscopy images containing marked cells, we assess not only their counting performance compared to several state-of-the-art methods but also their ability to correctly localize the counted cells. We show that commonly adopted counting metrics do not always agree with the localization performance of the tested models, and thus we suggest integrating the proposed evaluation protocol when developing novel cell counting solutions.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

02/02/2022

Evaluating Hebbian Learning in a Semi-supervised Setting

access here

Author(s)

Claudio Gennaro; Fabrizio Falchi; Gabriele Lagani; Giuseppe Amato;

Institution

ISTI-CNR; University of Pisa

Abstract

We propose a semi-supervised learning strategy for deep Convolutional Neural Networks (CNNs) in which an unsupervised pre-training stage, performed using biologically inspired Hebbian learning algorithms, is followed by supervised end-to-end backprop fine-tuning. We explored two Hebbian learning rules for the unsupervised pre-training stage: soft-Winner-Takes-All (soft-WTA) and nonlinear Hebbian Principal Component Analysis (HPCA). Our approach was applied in sample efficiency scenarios, where the amount of available labeled training samples is very limited, and unsupervised pre-training is therefore beneficial. We performed experiments on CIFAR10, CIFAR100, and Tiny ImageNet datasets. Our results show that Hebbian outperforms Variational Auto-Encoder (VAE) pre-training in almost all the cases, with HPCA generally performing better than soft-WTA.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

01/02/2022

Reinforced Damage Minimization in Critical Events for Self-driving Vehicles

access here

Author(s)

Claudio Gennaro; Fabrizio Falchi; Francesco Merola; Marco Di Benedetto

Institution

ISTI-CNR;

Abstract

Self-driving systems have recently received massive attention in both academic and industrial contexts, leading to major improvements in standard navigation scenarios typically identified as well-maintained urban routes. Critical events like road accidents or unexpected obstacles, however, require the execution of specific emergency actions that deviate from the ordinary driving behavior and are therefore harder to incorporate in the system. In this context, we propose a system that is specifically built to take control of the vehicle and perform an emergency maneuver in case of a dangerous scenario. The presented architecture is based on a deep reinforcement learning algorithm, trained in a simulated environment and using raw sensory data as input. We evaluate the system’s performance on several typical pre-accident scenario and show promising results, with the vehicle being able to consistently perform an avoidance maneuver to nullify or minimize the incoming damage.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

01/02/2022

Training Convolutional Neural Networks with Competitive Hebbian Learning Approaches

access here

Author(s)

Claudio Gennaro; Fabrizio Falchi; Gabriele Lagani; Giuseppe Amato;

Institution

ISTI-CNR; University of Pisa

Abstract

We explore competitive Hebbian learning strategies to train feature detectors in Convolutional Neural Networks (CNNs), without supervision. We consider variants of the Winner-Takes-All (WTA) strategy explored in previous works, i.e. k-WTA, e-soft-WTA and p-soft-WTA, performing experiments on different object recognition datasets. Results suggest that the Hebbian approaches are effective to train early feature extraction layers, or to re-train higher layers of a pre-trained network, with soft competition generally performing better than other Hebbian approaches explored in this work. Our findings encourage a path of cooperation between neuroscience and computer science towards a deeper investigation of biologically inspired learning principles.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

18/01/2022

Comparing the performance of Hebbian against backpropagation learning using convolutional neural networks

access here

Author(s)

Claudio Gennaro; Fabrizio Falchi; Gabriele Lagani; Giuseppe Amato;

Institution

ISTI-CNR; University of Pisa

Abstract

In this paper, we investigate Hebbian learning strategies applied to Convolutional Neural Network (CNN) training. We consider two unsupervised learning approaches, Hebbian Winner-Takes-All (HWTA), and Hebbian Principal Component Analysis (HPCA). The Hebbian learning rules are used to train the layers of a CNN in order to extract features that are then used for classification, without requiring backpropagation (backprop). Experimental comparisons are made with state-of-the-art unsupervised (but backprop-based) Variational Auto-Encoder (VAE) training. For completeness,we consider two supervised Hebbian learning variants (Supervised Hebbian Classifiers—SHC, and Contrastive Hebbian Learning—CHL), for training the final classification layer, which are compared to Stochastic Gradient Descent training. We also investigate hybrid learning methodologies, where some network layers are trained following the Hebbian approach, and others are trained by backprop. We tested our approaches on MNIST, CIFAR10, and CIFAR100 datasets. Our results suggest that Hebbian learning is generally suitable for training early feature extraction layers, or to retrain higher network layers in fewer training epochs than backprop. Moreover, our experiments show that Hebbian learning outperforms VAE training, with HPCA performing generally better than HWTA.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

10/01/2022

AVAE: Adversarial Variational Auto Encoder

access here

Author(s)

Antoine Plumerault; Céline Hudelot; Hervé Le Borgne;

Institution

Université Paris-Saclay;

Abstract

Among the wide variety of image generative models, two models stand out: Variational Auto Encoders (VAE) and Generative Adversarial Networks (GAN). GANs can produce realistic images, but they suffer from mode collapse and do not provide simple ways to get the latent representation of an image. On the other hand, VAEs do not have these problems, but they often generate images less realistic than GANs. In this article, we explain that this lack of realism is partially due to a common underestimation of the natural image manifold dimensionality. To solve this issue we introduce a new framework that combines VAE and GAN in a novel and complementary way to produce an auto-encoding model that keeps VAEs properties while generating images of GAN-quality. We evaluate our approach both qualitatively and quantitatively on five image datasets.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

10/01/2022

AffectGAN: Affect-Based Generative Art Driven by Semantics

access here

Author(s)

Antonios Liapis; Georgios N. Yannakakis; Theodoros Galanos

Institution

University of Malta

Abstract

This paper introduces a novel method for generating artistic images that express particular affective states. Leveraging state-of-the-art deep learning methods for visual generation (through generative adversarial networks), semantic models from OpenAI, and the annotated dataset of the visual art encyclopedia WikiArt, our AffectGAN model is able to generate images based on specific or broad semantic prompts and intended affective outcomes. A small dataset of 32 images generated by AffectGAN is annotated by 50 participants in terms of the particular emotion they elicit, as well as their quality and novelty. Results show that for most instances the intended emotion used as a prompt for image generation matches the participants’ responses. This small-scale study brings forth a new vision towards blending affective computing with computational creativity, enabling generative systems with intentionality in terms of the emotions they wish their output to elicit.

Access

Open Access

Type of Publication

Conference paper

Publisher

International Conference on Affective Computing and Intelligent Interaction Workshops

access here

10/01/2022

Go-Blend Behavior and Affect

access here

Author(s)

Antonios Liapis; Georgios N. Yannakakis; Matthew Barthet

Institution

University of Malta

Abstract

This paper proposes a paradigm shift for affective computing by viewing the affect modeling task as a reinforcement learning process. According to our proposed framework the context (environment) and the actions of an agent define the common representation that interweaves behavior and affect. To realise this framework we build on recent advances in reinforcement learning and use a modified version of the Go-Explore algorithm which has showcased supreme performance in hard exploration tasks. In this initial study, we test our framework in an arcade game by training Go-Explore agents to both play optimally and attempt to mimic human demonstrations of arousal. We vary the degree of importance between optimal play and arousal imitation and create agents that can effectively display a palette of affect and behavioral patterns. Our Go-Explore implementation not only introduces a new paradigm for affect modeling; it empowers believable AI-based game testing by providing agents that can blend and express a multitude of behavioral and affective patterns.

Access

Open Access

Type of Publication

Conference paper

Publisher

International Conference on Affective Computing and Intelligent Interaction Workshops

access here

10/01/2022

Language Based Image Quality Assessment

access here

Author(s)

Alberto Del Bimbo; Leonardo Galteri; Lorenzo Seidenari; Marco Bertini; Pietro Bongini

Institution

University of Florence;

Abstract

Evaluation of generative models, in the visual domain, is often performed providing anecdotal results to the reader. In the case of image enhancement, reference images are usually available. Nonetheless, using signal based metrics often leads to counterintuitive results: highly natural crisp images may obtain worse scores than blurry ones. On the other hand, blind reference image assessment may rank images reconstructed with GANs higher than the original undistorted images. To avoid time consuming human based image assessment, semantic computer vision tasks may be exploited instead [9, 25, 33]. In this paper we advocate the use of language generation tasks to evaluate the quality of restored images. We show experimentally that image captioning, used as a downstream task, may serve as a method to score image quality. Captioning scores are better aligned with human rankings with respect to signal based metrics or no-reference image quality metrics. We show insights on how the corruption, by artifacts, of local image structure may steer image captions in the wrong direction.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

06/01/2022

Effective triplet mining improves training of multi-scale pooled CNN for image retrieval

access here

Author(s)

Alberto Del Bimbo; Federico Vaccaro; Marco Bertini; Tiberio Uricchio

Institution

University of Florence;

Abstract

In this paper, we address the problem of content-based image retrieval (CBIR) by learning images representations based on the activations of a Convolutional Neural Network. We propose an end-to-end trainable network architecture that exploits a novel multi-scale local pooling based on the trainable aggregation layer NetVLAD (Arandjelovic et al in Proceedings of the IEEE conference on computer vision and pattern recognition CVPR, NetVLAD, 2016) and bags of local features obtained by splitting the activations, allowing to reduce the dimensionality of the descriptor and to increase the performance of retrieval. Training is performed using an improved triplet mining procedure that selects samples based on their difficulty to obtain an effective image representation, reducing the risk of overfitting and loss of generalization. Extensive experiments show that our approach, that can be effectively used with different CNN architectures, obtains state-of-the-art results on standard and challenging CBIR datasets.

Access

Closed Access

Type of Publication

Journal article

Publisher

N/A

access here

04/01/2022

Face Verification with Challenging Imposters and Diversified Demographics

access here

Author(s)

Adrian Popescu; Bogdan Ionescu; Jérôme Deshayes-Chossart; Liviu-Daniel Stefan;

Institution

Université Paris-Saclay; University Politehnica of Bucharest

Abstract

Face verification aims to distinguish between genuine and imposter pairs of faces, which include the same or dif-
ferent identities, respectively. The performance reported in recent years gives the impression that the task is practically solved. Here, we revisit the problem and argue that existing evaluation datasets were built using two oversimplifying design choices. First, the usual identity selection to form imposter pairs is not challenging enough because, in practice, verification is needed to detect challenging imposters.
Second, the underlying demographics of existing datasets are often insufficient to account for the wide diversity of
facial characteristics of people from across the world. To mitigate these limitations, we introduce the F aV CI2D
dataset. Imposter pairs are challenging because they include visually similar faces selected from a large pool of
demographically diversified identities. The dataset also includes metadata related to gender, country and age to facilitate fine-grained analysis of results. F aV CI2D is generated from freely distributable resources. Experiments with state-of-the-art deep models that provide nearly 100% performance on existing datasets show a significant performance drop for FaVCI2D, confirming our starting hypothesis. Equally important, we analyze legal and ethical challenges which appeared in recent years and hindered the development of face analysis research. We intro-
duce a series of design choices which address these challenges and make the dataset constitution and usage more
sustainable and fairer.

Access

Open Access

Type of Publication

Conference paper

Publisher

Winter Conference on Applications of Computer Vision

access here

04/01/2022

Dataset Knowledge Transfer for Class-Incremental Learning without Memory

access here

Author(s)

Adrian Popescu; Darian Onchis; Eden Belouadah; Habib Slim

Institution

IMT Atlantic; Université Paris-Saclay; West University of Timisoara

Abstract

Incremental learning enables artificial agents to learn from sequential data. While important progress was made by exploiting deep neural networks, incremental learning remains very challenging. This is particularly the case when no memory of past data is allowed and catastrophic forgetting has a strong negative effect. We tackle class-incremental learning without memory by adapting prediction bias correction, a method which makes predictions of past and new classes more comparable. It was proposed when a memory is allowed and cannot be directly used without memory, since samples of past classes are required. We introduce a two-step learning process which allows the transfer of bias correction parameters between reference and target datasets. Bias correction is first optimized offline on reference datasets which have an associated validation memory. The obtained correction parameters are then transferred to target datasets, for which no memory is available. The second contribution is to introduce a finer modeling of bias correction by learning its parameters per incremental state instead of the usual past vs. new class modeling. The proposed dataset knowledge transfer is applicable to any incremental method which works without memory. We test its effectiveness by applying it to four existing methods. Evaluation with four target datasets and different configurations shows consistent improvement, with practically no computational and memory overhead.

Access

Open Access

Type of Publication

Conference paper

Publisher

Winter Conference on Applications of Computer Vision

access here

04/01/2022

Unveiling Real-Life Effects of Online Photo Sharing

access here

Author(s)

Adrian Popescu; Jérôme Deshayes-Chossart; Van-Khoa Nguyen

Institution

Université Paris-Saclay;

Abstract

Social networks give free access to their services in exchange for the right to exploit their users’ data. Data sharing is done in an initial context which is chosen by the users. However, data are used by social networks and third parties in different contexts which are often not transparent. In order to unveil such usages, we propose an approach which focuses on the effects of data sharing in impactful real-life situations. Focus is put on visual content because of its strong influence in shaping online user profiles. The approach relies on three components: (1) a set of visual objects with associated situation impact ratings obtained by crowdsourcing, (2) a corresponding set of object detectors for mining users’ photos and (3) a ground truth dataset made of 500 visual user profiles which are manually rated per situation. These components are combined in LERVUP, a method which learns to rate visual user profiles in each situation. LERVUP exploits a new image descriptor which aggregates object ratings and object detections at user level and an attention mechanism which boosts highly-rated objects to prevent them from being overwhelmed by low-rated ones. Performance is evaluated per situation by measuring the correlation between the automatic ranking of profile ratings and a manual ground truth. Results indicate that LERVUP is effective since a strong correlation of the two rankings is obtained. A practical implementation of the approach in a mobile app which raises user awareness about shared data usage is also discussed.

Access

Open Access

Type of Publication

Conference paper

Publisher

Winter Conference on Applications of Computer Vision

access here

22/12/2021

RosneT: A Block Tensor Algebra Library for Out-of-Core Quantum Computing Simulation

access here

Author(s)

Anna Queralt Artur Garcia-Saez; Francesc Lordan Javier Conejero Rosa M. Badia Sergio Sanchez-Ramirez Toni Cortes

Institution

Barcelona Supercomputing Center; Universitat Politecnica de Catalunya

Abstract

With the advent of more powerful Quantum Computers, the need for larger Quantum Simulations has boosted. As the amount of resources grows exponentially with size of the target system Tensor Networks emerge as an optimal framework with which we represent Quantum States in tensor factorizations. As the extent of a tensor network increases, so does the size of intermediate tensors requiring HPC tools for their manipulation. Simulations of medium-sized circuits cannot fit on local memory, and solutions for distributed contraction of tensors are scarce. In this work we present RosneT, a library for distributed, out-ofcore block tensor algebra. We use the PyCOMPSs programming model to transform tensor operations into a collection of tasks handled by the COMPSs runtime, targeting executions in existing and upcoming Exascale supercomputers. We report results validating our approach showing good scalability in simulations of Quantum circuits of up to 53 qubits.

Access

Open Access

Type of Publication

Publication

Publisher

IEEE/ACM Second International Workshop on Quantum Computing Software

access here

20/12/2021

Using Vision Transformers and Memorable Moments for the Prediction of Video Memorability

access here

Author(s)

Bogdan Ionescu; Mihai Gabriel Constantin

Institution

University Politehnica of Bucharest

Abstract

This paper describes the approach taken by the AI Multimedia Lab team for the MediaEval 2021 Predicting Media Memorability task. Our approach is based on a Vision Transformer-based learning method, which is optimized by filtering the training sets for the two proposed datasets.We attempt to train the methods we propose with video segments that are more representative for the videos they are part of. We test several types of filtering architectures, and submit and test the architectures that best performed in our preliminary studies.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

16/12/2021

Qibo: a framework for quantum simulation with hardware acceleration

access here

Author(s)

Adrián Pérez-Salinas Artur Garcia-Saez; Carlos Bravo-Prieto Diego Gárcia-Martín José I. Latorre Sergi Ramos-Calderer Stavros Efthymiou

Institution

Barcelona Supercomputing Center; Center fo Quantum Technologies Instituto de Física Teórica Qilimanjaro Quantum Tech Quantum Research Centre Universidad de Barcelona University of Milano- Bicocca

Abstract

We present Qibo, a new open-source software for fast evaluation of quantum circuits and adiabatic evolution which takes full advantage of hardware accelerators. The growing interest in quantum computing and the recent developments of quantum hardware devices motivates the development of new advanced computational tools focused on performance and usage simplicity. In this work we introduce a new quantum simulation framework that enables developers to delegate all complicated aspects of hardware or platform implementation to the library so they can focus on the problem and quantum algorithms at hand. This software is designed from scratch with simulation performance, code simplicity and user friendly interface as target goals. It takes advantage of hardware acceleration such as multi-threading CPU, single GPU and multi-GPU devices.

Access

Open Access

Type of Publication

Publication

Publisher

QST

access here

12/12/2021

Conditioned Image Retrieval for Fashion using Contrastive Learning and CLIP-based Features

access here

Author(s)

Alberto Baldrati; Alberto Del Bimbo; Marco Bertini; Tiberio Uricchio

Institution

University of Florence;

Abstract

Building on the recent advances in multimodal zero-shot represen-
tation learning, in this paper we explore the use of features obtained
from the recent CLIP model to perform conditioned image retrieval.
Starting from a reference image and an additive textual description
of what the user wants with respect to the reference image, we
learn a Combiner network that is able to understand the image
content, integrate the textual description and provide combined
feature used to perform the conditioned image retrieval. Starting
from the bare CLIP features and a simple baseline, we show that
a carefully crafted Combiner network, based on such multimodal
features, is extremely effective and outperforms more complex state
of the art approaches on the popular FashionIQ dataset.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

07/12/2021

Towards General Models of Player Experience: A Study Within Genres

access here

Author(s)

Antonios Liapis; David Melhart; Georgios N. Yannakakis;

Institution

University of Malta

Abstract

To which degree can abstract gameplay metrics capture the player experience in a general fashion within a game genre? In this comprehensive study we address this question across three different videogame genres: racing, shooter, and platformer games. Using high-level gameplay features that feed preference learning models we are able to predict arousal accurately across different games of the same genre in a large-scale dataset of over 1,000 arousal-annotated play sessions. Our genre models predict changes in arousal with up to 74% accuracy on average across all genres and 86% in the best cases. We also examine the feature importance during the modelling process and find that time-related features largely contribute to the performance of both game and genre models. The prominence of these game-agnostic features show the importance of the temporal dynamics of the play experience in modelling, but also highlight some of the challenges for the future of general affect modelling in games and beyond.

Access

Open Access

Type of Publication

Conference paper

Publisher

IEEE Conference on Games

access here

07/12/2021

Contrastive Learning of Generalized Game Representations

access here

Author(s)

Antonios Liapis; Chintan Triverdi; Georgios N. Yannakakis;

Institution

University of Malta

Abstract

Representing games through their pixels offers a promising approach for building general-purpose and versatile game models. While games are not merely images, neural network models trained on game pixels often capture differences of the visual style of the image rather than the content of the game. As a result, such models cannot generalize well even within similar games of the same genre. In this paper we build on recent advances in contrastive learning and showcase its benefits for representation learning in games. Learning to contrast images of games not only classifies games in a more efficient manner; it also yields models that separate games in a more meaningful fashion by ignoring the visual style and focusing, instead, on their content. Our results in a large dataset of sports video games containing 100k images across 175 games and 10 game genres suggest that contrastive learning is better suited for learning generalized game representations compared to conventional supervised learning. The findings of this study bring us closer to universal visual encoders for games that can be reused across previously unseen games without requiring retraining or fine-tuning.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

01/12/2021

Learning to quantify: LeQua 2022 datasets

access here

Author(s)

Alejandro Moreo; Andrea Esuli; Fabrizio Sebastiani;

Institution

ISTI-CNR;

Abstract

The aim of LeQua 2022 (the 1st edition of the CLEF “Learning to Quantify” lab) is to allow the comparative evaluation of methods for “learning to quantify” in textual datasets, i.e., methods for training predictors of the relative frequencies of the classes of interest in sets of unlabelled textual documents. These predictors (called “quantifiers”) will be required to issue predictions for several such sets, some of them characterized by class frequencies radically different from the ones of the training set.

Access

Open Access

Type of Publication

Paper

Publisher

N/A

access here

30/11/2021

Generation of Realistic Synthetic Financial Time-Series

access here

Author(s)

Bogdad Andrei Boteanu; Bogdan Ionescu; Bomi Kim; Claudiu Lamba; Liviu-Daniel Stefan; Mihai Dogariu

Institution

Hana Institute of Technology; University Politehnica of Bucharest

Abstract

Financial markets have always been a point of interest for automated systems. Due to their complex nature, financial algorithms and fintech frameworks require vast amounts of data to accurately respond to market fluctuations. This data availability is tied to the daily market evolution so it is impossible to accelerate its acquisition. In this paper, we discuss several solutions for augmenting financial datasets via synthesizing realistic time-series with the help of generative models. This problem is complex since financial time series present very specific properties, e.g., fat-tail distribution, cross-correlation between different stocks, specific autocorrelation, cluster volatility etc. In particular, we propose solutions for capturing cross-correlations between different stocks and for transitioning from fixed to variable length time-series without resorting to sequence modeling networks, and adapt various network architectures, e.g., fully connected and convolutional GANs, variational autoencoders, and generative moment matching networks. Finally, we tackle the problem of evaluating the quality of synthetic financial time-series. We introduce qualitative and quantitative metrics, along with a portfolio trend prediction framework which validates our generative models’ performance. We carry out experiments on real-world financial data extracted from the US stock market proving the benefits of these techniques.

Access

Open Access

Type of Publication

Journal article

Publisher

N/A

access here

22/11/2021

Tensor Component Analysis for Interpreting the Latent Space of GANs

access here

Author(s)

Ioannis Patras; James Oldfield; Markos Georgopoulos; Mihalis Nicolaou; Yannis Panagakis

Institution

Cyprus Institute; Imperial College London; Queen Mary University of London; University of Athens

Abstract

This paper addresses the problem of finding interpretable directions in the latent space of pre-trained Generative Adversarial Networks (GANs) to facilitate controllable image synthesis. Such interpretable directions correspond to transformations that can affect both the style and geometry of the synthetic images. However, existing approaches
that utilise linear techniques to find these transformations often fail to provide an intuitive way to separate these two sources of variation. To address this, we propose to a) perform a multilinear decomposition of the tensor of intermediate representations, and b) use a tensor-based regression to map directions found using this decomposition to the latent space. Our scheme allows for both linear edits corresponding to the individual modes of the tensor, and non-linear ones that model the multiplicative interactions between them. We show experimentally that we can utilise the former to better separate style- from geometry-based transformations, and the latter to generate an extended set of possible transformations in comparison to prior works. We demonstrate our approach’s efficacy both quantitatively and qualitatively compared to the current state-of-the-art.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

22/11/2021

Procedural Terrain Generation Using Generative Adversarial Networks

access here

Author(s)

George Voulgaris; Ioannis Mademlis; Ioannis Pitas;

Institution

Aristotle University of Thessaloniki;

Abstract

Synthetic terrain realism is critical in VR applications based on computer graphics (e.g., games, simulations). Although fast procedural algorithms for automated terrain generation do exist, they still require human effort. This paper proposes a novel approach to procedural terrain generation, relying on Generative Adversarial Networks (GANs). The neural model is trained using terrestrial Points-of-Interest (PoIs, described by their geodesic coordinates/altitude) and publicly available corresponding satellite images. After training is complete, the GAN can be employed for deriving realistic terrain images on-the- fly, by merely forwarding through it a rough 2D scatter plot of desired PoIs in image form (so-called “altitude image”). We demonstrate that such a GAN is able to translate this rough, quickly produced sketch into an actual photorealistic terrain image. Additionally, we describe a strategy for enhancing the visual diversity of trained model synthetic output images, by tweaking input altitude image orientation during GAN training. Finally, we perform an objective and a subjective evaluation of the proposed method. Results validate the latter’s ability to
rapidly create life-like terrain images from minimal input data.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

15/11/2021

Locally Private Graph Neural Networks

access here

Author(s)

Daniel Gatica-Perez; Sina Sajadmanesh

Institution

Idiap Research Institute

Abstract

Graph Neural Networks (GNNs) have demonstrated superior performance in learning node representations for various graph inference tasks. However, learning over graph data can raise privacy concerns when nodes represent people or human-related variables that involve sensitive or personal information. While numerous techniques have been proposed for privacy-preserving deep learning over non-relational data, there is less work addressing the privacy issues pertained to applying deep learning algorithms on graphs. In this paper, we study the problem of node data privacy, where graph nodes have potentially sensitive data that is kept private, but they could be beneficial for a central server for training a GNN over the graph. To address this problem, we develop a privacy-preserving, architecture-agnostic GNN learning algorithm with formal privacy guarantees based on Local Differential Privacy (LDP). Specifically, we propose an LDP encoder and an unbiased rectifier, by which the server can communicate with the graph nodes to privately collect their data and approximate the GNN’s first layer. To further reduce the effect of the injected noise, we propose to prepend a simple graph convolution layer, called KProp, which is based on the multi-hop aggregation of the nodes’ features acting as a denoising mechanism. Finally, we propose a robust training framework, in which we benefit from KProp’s denoising capability to increase the accuracy of inference in the presence of noisy labels. Extensive experiments conducted over real-world datasets demonstrate that our method can maintain a satisfying level of accuracy with low privacy loss.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

15/11/2021

HeMoG: A White-Box Model to Unveil the Connection Between Saliency Information and Human Head Motion in Virtual Reality

access here

Author(s)

Dario Zanca; Lucile Sassatelli; Marco Goro; Miguel Rondon; Stefano Melacci

Institution

Friedrich-Alexander-Universität; Institut Universitaire de France; Université Côte d'Azur; University of Siena

Abstract

Immersive environments such as Virtual Reality (VR) are now a main area of interactive digital entertainment. The challenge to design personalized interactive VR systems is specifically to guide and adapt to the user’s attention. Understanding the connection between the visual content and the human attentional process is therefore key. In this article, we investigate this connection by first proposing a new head motion predictor named HeMoG. HeMoG is a white-box model built on physics of rotational motion and gravitation. Second, we compare HeMoG with existing reference Deep Learning models. We show that HeMoG can achieve similar or better performance and provides insights on the inner workings of these black-box models. Third, we study HeMoG parameters in terms of video categories and prediction horizons to gain knowledge on the connection between visual saliency and the head motion process.

Access

Open Access

Type of Publication

Conference paper

Publisher

International Conference on Artificial Intelligence and Virtual Reality

access here

02/11/2021

QuaPy

access here

Author(s)

Alejandro Moreo; Andrea Esuli; Fabrizio Sebastiani;

Institution

ISTI-CNR;

Abstract

QuaPy is an open source framework for Quantification (a.k.a. Supervised Prevalence Estimation) written in Python.

QuaPy roots on the concept of data sample, and provides implementations of most important concepts in quantification literature, such as the most important quantification baselines, many advanced quantification methods, quantification-oriented model selection, many evaluation measures and protocols used for evaluating quantification methods. QuaPy also integrates commonly used datasets and offers visualization tools for facilitating the analysis and interpretation of results.

Access

Open Access

Type of Publication

Paper

Publisher

N/A

access here

02/11/2021

Video Summarization Using Deep Neural Networks: A Survey

access here

Author(s)

Alexandros Metsai; Eleni Adamantidou; Evlampios Apostolidis; Ioannis Patras;

Institution

CERTH - Center for Research and Technology Hellas QMUL

Abstract

Video summarization technologies aim to create a concise and complete synopsis by selecting the most informative parts of the video content. Several approaches have been developed over the last couple of decades, and the current state of the art is represented by methods that rely on modern deep neural network architectures. This work focuses on the recent advances in the area and provides a comprehensive survey of the existing deep-learning-based methods for generic video summarization. After presenting the motivation behind the development of technologies for video summarization, we formulate the video summarization task and discuss the main characteristics of a typical deep-learning-based analysis pipeline. Then, we suggest a taxonomy of the existing algorithms and provide a systematic review of the relevant literature that shows the evolution of the deep-learning-based video summarization technologies and leads to suggestions for future developments. We then report on protocols for the objective evaluation of video summarization algorithms, and we compare the performance of several deep-learning-based approaches. Based on the outcomes of these comparisons, as well as some documented considerations about the amount of annotated data and the suitability of evaluation protocols, we indicate potential future research directions.

Access

Open Access

Type of Publication

Journal article

Publisher

Proceedings of the IEEE

access here

01/11/2021

Learning to Quantify: Methods and Applications (LQ 2021)

access here

Author(s)

Alejandro Moreo; Fabrizio Sebastiani; Juan José del Coz; Pablo González

Institution

Consiglio Nazionale delle Ricerche; University of Oviedo

Abstract

Learning to Quantify (LQ) is the task of training class prevalence estimators via supervised learning. The task of these estimators is to estimate, given an unlabelled set of data items D and a set of classes C = {c1, . . . , c|C|}, the prevalence (i.e., relative frequency) of each class ci in D. LQ is interesting in all applications of classification in which the final goal is not determining which class (or classes) individual unlabelled data items belong to, but estimating the distribution of the unlabelled data items across the classes of interest. Example disciplines whose interest in labelling data items is at the aggregate level (rather than at the individual level) are the social sciences, political science, market research, ecological modelling, and epidemiology. While LQ may in principle be solved by classifying each data item in D and counting how many such items have been labelled with ci, it has been shown that this “classify and count” (CC) method yields suboptimal quantification accuracy. As a result, quantification is now no longer considered a mere byproduct of classification and has evolved as a task of its own. The goal of this workshop is bringing together all researchers interested in methods, algorithms, and evaluation measures and methodologies for LQ, as well as practitioners interested in their practical application to managing large quantities of data.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

27/10/2021

Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction

access here

Author(s)

Elisa Ricci; Guanglei Yang Hao Tang; Mingli Ding Niculae Sebe

Institution

ETH Zurich; Harbin Institute of Technology; University of Trento;

Abstract

While convolutional neural networks have shown a
tremendous impact on various computer vision tasks, they
generally demonstrate limitations in explicitly modeling
long-range dependencies due to the intrinsic locality of
the convolution operation. Initially designed for natural
language processing tasks, Transformers have emerged as
alternative architectures with innate global self-attention
mechanisms to capture long-range dependencies. In this
paper, we propose TransDepth, an architecture that benefits
from both convolutional neural networks and transformers.
To avoid the network losing its ability to capture locallevel
details due to the adoption of transformers, we propose
a novel decoder that employs attention mechanisms
based on gates. Notably, this is the first paper that applies
transformers to pixel-wise prediction problems involving
continuous labels (i.e., monocular depth prediction and
surface normal estimation). Extensive experiments demonstrate
that the proposed TransDepth achieves state-of-theart
performance on three challenging datasets. Our code is
available at: https://github.com/ygjwd12345/TransDepth.

Access

Open Access

Type of Publication

Conference paper

Publisher

International Conference on Computer Vision

access here

27/10/2021

Intrinsic-Extrinsic Preserved GANs for Unsupervised 3D Pose Transfer

access here

Author(s)

Guoying Zhao Hao Tang; Haoyou Chen Henglin Shi; Nicu Sebe; Wei Peng

Institution

ETH Zurich; University of Oulu; University of Trento;

Abstract

With the strength of deep generative models, 3D pose transfer regains intensive research interests in recent years. Existing methods mainly rely on a variety of constraints to achieve the pose transfer over 3D meshes, e.g., the need for manually encoding for shape and pose disentanglement. In this paper, we present an unsupervised approach to conduct the pose transfer between any arbitrate given 3D meshes. Specifically, a novel Intrinsic-Extrinsic Preserved Generative Adversarial Network (IEP-GAN) is presented for both intrinsic (i.e., shape) and extrinsic (i.e., pose) information preservation. Extrinsically, we propose a cooccurrence discriminator to capture the structural/pose invariance from distinct Laplacians of the mesh. Meanwhile, intrinsically, a local intrinsic-preserved loss is introduced to preserve the geodesic priors while avoiding heavy computations. At last, we show the possibility of using IEP-GAN to manipulate 3D human meshes in various ways, including pose transfer, identity swapping and pose interpolation with latent code vector arithmetic. The extensive experiments on various 3D datasets of humans, animals and hands qualitatively and quantitatively demonstrate the generality of our approach. Our proposed model produces better results and is substantially more efficient compared to recent state-ofthe- art methods. Code is available: https://github.com/mikecheninoulu/Unsupervised_IEPGAN

Access

Open Access

Type of Publication

Conference paper

Publisher

International Conference on Computer Vision

access here

27/10/2021

Why Approximate Matrix Square Root Outperforms Accurate SVD in Global Covariance Pooling?

access here

Author(s)

Nicu Sebe; Wei Wang; Yue Song

Institution

University of Trento;

Abstract

Global Covariance Pooling (GCP) aims at exploiting
the second-order statistics of the convolutional feature. Its
effectiveness has been demonstrated in boosting the classification
performance of Convolutional Neural Networks
(CNNs). Singular Value Decomposition (SVD) is used in
GCP to compute the matrix square root. However, the
approximate matrix square root calculated using Newton-
Schulz iteration [14] outperforms the accurate one computed
via SVD [15]. We empirically analyze the reason
behind the performance gap from the perspectives of data
precision and gradient smoothness. Various remedies for
computing smooth SVD gradients are investigated. Based
on our observation and analyses, a hybrid training protocol
is proposed for SVD-based GCP meta-layers such that
competitive performances can be achieved against Newton-
Schulz iteration. Moreover, we propose a new GCP metalayer
that uses SVD in the forward pass, and Pad´e approximants
in the backward propagation to compute the gradients.
The proposed meta-layer has been integrated into
different CNN models and achieves state-of-the-art performances
on both large-scale and fine-grained datasets.

Access

Open Access

Type of Publication

Conference paper

Publisher

International Conference on Computer Vision

access here

25/10/2021

Some like it tough: Improving model generalization via progressively increasing the training difficulty

access here

Author(s)

Hannes Fassold;

Institution

Joanneum Research;

Abstract

In this work, we propose to progressively increase the training difficulty during learning a neural network model via a novel strategy which we call mini-batch trimming. This strategy makes sure that the optimizer puts its focus in the later training stages on the more difficult samples, which we identify as the ones with the highest loss in the current mini-batch. The strategy is very easy to integrate into an existing training pipeline and does not necessitate a change of the network model. Experiments on several image classification problems show that mini-batch trimming is able to increase the generalization ability (measured via final test error) of the trained model.

Access

Open Access

Type of Publication

Conference paper

Publisher

International Conference on Advances in Signal Processing and Artificial Intelligen

access here

25/10/2021

Detecting speaking persons in video

access here

Author(s)

Hannes Fassold;

Institution

Joanneum Research;

Abstract

We present a novel method for detecting speaking persons in video, by extracting facial landmarks with a neural network and analysing these landmarks statistically over time.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

25/10/2021

Cascaded Cross MLP-Mixer GANs for Cross-View Image Translation

access here

Author(s)

Bin Ren; Hao Tang; Niculae Sebe

Institution

ETH Zurich; University of Trento;

Abstract

It is hard to generate an image at target view well for previous cross-view image translation methods that directly adopt a simple encoder-decoder or U-Net structure, especially for drastically different views and severe deformation cases. To ease this problem, we propose a novel two-stage framework with a new Cascaded Cross MLPMixer (CrossMLP) sub-network in the first stage and one refined pixel-level loss in the second stage. In the first stage, the CrossMLP sub-network learns the latent transformation cues between image code and semantic map code via our novel CrossMLP blocks. Then the coarse results are generated progressively under the guidance of those cues. Moreover, in the second stage, we design a refined pixel-level loss that eases the noisy semantic label problem with more reasonable regularization in a more compact fashion for better optimization. Extensive experimental results on Dayton [40] and CVUSA [42] datasets show that our method can generate significantly better results than state-of-the-art methods. The source code and trained models are available at https://github.com/Amazingren/CrossMLP.

Access

Open Access

Type of Publication

Conference paper

Publisher

British Machine Vision Conference

access here

25/10/2021

AniFormer: Data-driven 3D Animation with Transformer

access here

Author(s)

Guoying Zhao Hao Tang; Haoyou Chen Niculae Sebe

Institution

ETH Zurich; University of Oulu; University of Trento;

Abstract

We present a novel task, i.e., animating a target 3D object through the motion of a raw driving sequence. In previous works, extra auxiliary correlations between source and target meshes or intermedia factors are inevitable to capture the motions in the driving sequences. Instead, we introduce AniFormer, a novel Transformer-based architecture, that generates animated 3D sequences by directly taking the raw driving sequences and arbitrary same-type target meshes as inputs. Specifically, we customize the Transformer architecture for 3D animation that generates mesh sequences by integrating styles from target meshes and motions from the driving meshes. Besides, instead of the conventional single regression head in the vanilla Transformer, AniFormer generates multiple frames as outputs to preserve the sequential consistency of the generated meshes. To achieve this, we carefully design a pair of regression constraints, i.e., motion and appearance constraints, that can provide strong regularization on the generated mesh sequences. Our AniFormer achieves high-fidelity, realistic, temporally coherent animated results and outperforms compared start-of-the-art methods on benchmarks of diverse categories. Code is available: https://github.com/mikecheninoulu/AniFormer.

Access

Open Access

Type of Publication

Conference paper

Publisher

British Machine Vision Conference

access here

22/10/2021

On Generalizing Permutation-Based Representations for Approximate Search

access here

Author(s)

Claudio Gennaro; Giuseppe Amato; Lucia Vadicamo;

Institution

ISTI-CNR;

Abstract

In the domain of approximate metric search, the Permutation-based Indexing (PBI) approaches have been proved to be particularly suitable for dealing with large data collections. These methods employ a permutation-based representation of the data, which can be efficiently indexed using data structures such as inverted files. In the literature, the definition of the permutation of a metric object was derived by reordering the distances of the object to a set of pivots. In this paper, we aim at generalizing this definition in order to enlarge the class of permutations that can be used by PBI approaches. As a practical outcome, we defined a new type of permutation that is calculated using distances from pairs of pivots. The proposed technique permits us to produce longer permutations than traditional ones for the same number of object-pivot distance calculations. The advantage is that the use of inverted files built on permutation prefixes leads to greater efficiency in the search phase when longer permutations are used.

Access

Open Access

Type of Publication

Preprint

Publisher

International Conference on Similarity Search and Applications

access here

11/10/2021

Partially Fake it Till you Make It: Mixing Real and Fake Thermal Images for Improved Object Detection

access here

Author(s)

Alberto Del Bimbo; Francesco Bongini; Lorenzo Berlincioni; Marco Bertini;

Institution

University of Florence;

Abstract

In this paper we propose a novel data augmentation approach for visual content domains that have scarce training datasets, composit- ing synthetic 3D objects within real scenes. We show the perfor- mance of the proposed system in the context of object detection in thermal videos, a domain where i) training datasets are very limited compared to visible spectrum datasets and ii) creating full realistic synthetic scenes is extremely cumbersome and expensive due to the difficulty in modeling the thermal properties of the materials of the scene. We compare different augmentation strategies, including state of the art approaches obtained through RL techniques, the injection of simulated data and the employment of a generative model, and study how to best combine our proposed augmentation with these other techniques.

Access

Open Access

Type of Publication

Conference paper

Publisher

ACM Multimedia Systems Conference

access here

05/10/2021

WarpedGANSpace: Finding non-linear RBF paths in GAN latent space

access here

Author(s)

Christos Tzelepis; Georgios Tzimiropoulos; Ioannis Patras;

Institution

Queen Mary University of London;

Abstract

This work addresses the problem of discovering, in an unsupervised manner, interpretable paths in the latent space of pretrained GANs, so as to provide an intuitive and easy way of controlling the underlying generative factors. In doing so, it addresses some of the limitations of the state-of-the-art works, namely, a) that they discover directions that are independent of the latent code, i.e., paths that are linear, and b) that their evaluation relies either on visual inspection or on laborious human labeling. More specifically, we propose to learn non-linear warpings on the latent space, each one parametrized by a set of RBF-based latent space warping functions, and where each warping gives rise to a family of non-linear paths via the gradient of the function. Building on the work of Voynov and Babenko, that discovers linear paths, we optimize the trainable parameters of the set of RBFs, so as that images that are generated by codes along different paths, are easily distinguishable by a discriminator network. This leads to easily distinguishable image transformations, such as pose and facial expressions in facial images. We show that linear paths can be derived as a special case of our method, and show experimentally that non-linear paths in the latent space lead to steeper, more disentangled and interpretable changes in the image space than in state-of-the art methods, both qualitatively and quantitatively. We make the code and the pretrained models publicly available at: https://github.com/chi0tzp/WarpedGANSpace.

Access

Open Access

Type of Publication

Conference paper

Publisher

International Conference on Computer Vision

access here

01/10/2021

Leader and breakaway detection in racing sports videos

access here

Author(s)

Charalampos Symeonidis; Ioannis Pitas; Sotirios Papadopoulos

Institution

Aristotle University of Thessaloniki;

Abstract

This paper addresses the important problem of leader detection in racing sports videos (e.g., cycling, boating and car racing events), as his/her proper framing is a pivotal issue in racing sports cinematography, where the events have a linear spatial deployment. Over the last few years, as autonomous drone vision and cinematography emerged, new challenges appeared in drone vision. While, until recently, most computer vision methods typically addressed still camera AV footage, drone sports cinematography typically employs moving cameras. In this paper, we solve the problem of leader detection in a group of similarly moving targets in sports videos, e.g. the leader of a sports cyclist group and his/her breakaway during a cycling event. This is very useful in drone sports cinematography, as it is important that the drone camera automatically centers on such a leader. We demonstrate that the novel method described in this paper can effectively solve the problem of leader detection in sports videos.

Access

Open Access

Type of Publication

Conference paper

Publisher

IEEE International Workshop on Multimedia Signal Processing

access here

01/10/2021

Estimating continuous affect with label uncertainty

access here

Author(s)

Christos Tzelepis; Ioannis Patras; Niki Maria Foteinopoulou

Institution

Queen Mary University of London;

Abstract

Continuous affect estimation is a problem where there is an inherent uncertainty and subjectivity in the labels that accompany data samples — typically, datasets use the average of multiple annotations or self-reporting to obtain ground truth labels. In this work, we propose a method for uncertainty-aware continuous affect estimation, that models explicitly the uncertainty of the ground truth label as a uni-variate Gaussian with mean equal to the ground truth label, and unknown variance. For each sample, the proposed neural network estimates not only the value of the target label (valence and arousal in our case), but also the variance. The network is trained with a loss that is defined as the KL-divergence between the estimation (valence/arousal) and the Gaussian around the ground truth. We show that, in two affect recognition problems with real data, the estimated variances are correlated with measures of uncertainty/error in the labels that are extracted by considering multiple annotations of the data.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

01/10/2021

Learning and Reasoning for Cultural Metadata Quality

access here

Author(s)

Anna Bobasheva; Fabien Gandon; Frédéric Precioso;

Institution

Université Côte d'Azur;

Abstract

This work combines semantic reasoning and machine learning to create tools that allow curators of the visual art collections to identify and correct the annotations of the artwork as well as to improve the relevance of the content-based search results in these collections. The research is based on the Joconde database maintained by French Ministry of Culture that contains illustrated artwork records from main French public and private museums representing archeological objects, decorative arts, fine arts, historical and scientific documents, etc. The Joconde database includes semantic metadata that describes properties of the artworks and their content. The developed methods create a data pipeline that processes metadata, trains a Convolutional Neural Network image classification model, makes prediction for the entire collection and expands the metadata to be the base for the SPARQL search queries. We developed a set of such queries to identify noise and silence in the human annotations and to search image content with results ranked according to the relevance of the objects quantified by the prediction score provided by the deep learning model. We also developed methods to discover new contextual relationships between the concepts in the metadata by analyzing the contrast between the concepts similarities in the Joconde’s semantic model and other vocabularies and we tried to improve the model prediction scores based on the semantic relations. Our results show that cross-fertilization between symbolic AI and machine learning can indeed provide the tools to address the challenges of the museum curators work describing the artwork pieces and searching for the relevant images.

Access

Open Access

Type of Publication

Journal article

Publisher

ACM Journal on Computing and Cultural Heritage

access here

30/09/2021

Adversarial unsupervised video summarization augmented with dictionary loss

access here

Author(s)

Ioannis Mademlis; Ioannis Pitas; Michail Kaseris

Institution

Aristotle University of Thessaloniki;

Abstract

Automated unsupervised video summarization by key-frame extraction consists in identifying representative video frames, best abridging a complete input sequence, and temporally ordering them to form a video summary, without relying on manually constructed ground-truth key-frame sets. State-of-the-art unsupervised deep neural approaches consider the desired summary to be a subset of the original sequence, composed of video frames that are sufficient to visually reconstruct the entire input. They typically employ a pre-trained CNN for extracting a vector representation per RGB video frame and a baseline LSTM adversarial learning framework for identifying key-frames. In this paper, to better guide the network towards properly selecting video frames that can faithfully reconstruct the original video, we augment the baseline framework with an additional LSTM autoencoder, which learns in parallel a fixed-length representation of the entire original input sequence. This is exploited during training, where a novel loss term inspired by dictionary learning is added to the network optimization objectives, further biasing key-frame selection towards video frames which are collectively able to recreate the original video. Empirical evaluation on two common public relevant datasets indicates highly favourable results.

Access

Open Access

Type of Publication

Conference paper

Publisher

IEEE International Conference on Image Processing

access here

30/09/2021

Fast Video Visual Quality and Resolution Improvement using SR-UNet

access here

Author(s)

Alberto Del Bimbo; Federico Vaccaro; Marco Bertini; Tiberio Uricchio

Institution

University of Florence;

Abstract

In this paper, we address the problem of real-time video quality enhancement, considering both frame super-resolution and com- pression artifact-removal. The first operation increases the sam- pling resolution of video frames, the second removes visual artifacts such as blurriness, noise, aliasing, or blockiness introduced by lossy compression techniques, such as JPEG encoding for single-images, or H.264/H.265 for video data.

We propose to use SR-UNet, a novel network architecture based on UNet, that has been specialized for fast visual quality improve- ment (i.e. capable of operating in less than 40ms, to be able to operate on videos at 25FPS). We show how this network can be used in a streaming context where the content is generated live, e.g. in video calls, and how it can be optimized when video to be streamed are prepared in advance. The network can be used as a final post processing, to optimize the visual appearance of a frame before showing it to the end-user in a video player. Thus, it can be applied without any change to existing video coding and transmission pipelines.

Access

Open Access

Type of Publication

Paper

Publisher

ACM Multimedia Systems Conference

access here

30/09/2021

Fast Video Visual Quality and Resolution Improvement using SR-UNet

access here

Author(s)

Alberto Del Bimbo; Federico Vaccaro; Marco Bertini; Tiberio Uricchio

Institution

Università degli Studi di Firenze;

Abstract

Access

Open Access

Type of Publication

Conference paper

Publisher

ACM Multimedia Systems Conference

access here

22/09/2021

Layout-to-Image Translation With Double Pooling Generative Adversarial Networks

access here

Author(s)

Hao Tang; Nicu Sebe;

Institution

University of Trento;

Abstract

In this paper, we address the task of layout-to-image translation, which aims to translate an input semantic layout to a realistic image. One open challenge widely observed in existing methods is the lack of effective semantic constraints during the image translation process, leading to models that cannot preserve the semantic information and ignore the semantic dependencies within the same object. To address this issue, we propose a novel Double Pooing GAN (DPGAN) for generating photo-realistic and semantically-consistent results from the input layout. We also propose a novel Double Pooling Module (DPM), which consists of the Square-shape Pooling Module (SPM) and the Rectangle-shape Pooling Module (RPM). Specifically, SPM aims to capture short range semantic dependencies of the input layout with different spatial scales, while RPM aims to capture long-range semantic dependencies from both horizontal and vertical directions. We then effectively fuse both outputs of SPM and RPM to further enlarge the receptive field of our generator. Extensive experiments on five popular datasets show that the proposed DPGAN achieves better results than state-of-the-art methods. Finally, both SPM and SPM are general and can be seamlessly integrated into any GAN-based architectures to strengthen the feature representation. The code is available at https://github.com/Ha0Tang/DPGAN.

Access

Closed Access

Type of Publication

Journal article

Publisher

IEEE Transactions on Image Processing

access here

13/09/2021

MEYE: Web-app for translational and real-time pupillometry

access here

Author(s)

Alessandro Benedetto; Aurelia Viglione; Giulia Ricci; Giulia Sagona; Giuseppe Amato; Leonardo Lupori; Luca Lo Verde; Raffaele Mazziotti; Tommaso Pizzorusso

Institution

IRCCS Stella Maris Foundation; ISTI-CNR; Italy National Research Council; University of Florence; University of Pisa

Abstract

Pupil dynamics alterations have been found in patients affected by a variety of neuropsychiatric conditions, including autism. Studies in mouse models have used pupillometry for phenotypic assessment and as a proxy for arousal. Both in mice and humans, pupillometry is non-invasive and allows for longitudinal experiments supporting temporal specificity, however, its measure requires dedicated setups. Here, we introduce a Convolutional Neural Network that performs online pupillometry in both mice and humans in a web app format. This solution dramatically simplifies the usage of the tool for the non-specialist and non-technical operators. Because a modern web browser is the only software requirement, this choice is of great interest given its easy deployment and set-up time reduction. The tested model performances indicate that the tool is sensitive enough to detect both locomotor-induced and stimulus-evoked pupillary changes, and its output is comparable with state-of-the-art commercial devices.

Access

Open Access

Type of Publication

Journal article

Publisher

eNeuro

access here

04/09/2021

Explainable deep learning for efficient and robust pattern recognition: A survey of recent developments

access here

Author(s)

Been Kim; Jingkuan Song; Niculae Sebe Qiang Liu; Xiang Wang; Xianglong Liu; Xiao Bai

Institution

Beihang University; Google USA; University of Texas; University of Trento;

Abstract

Deep learning has recently achieved great success in many visual recognition tasks. However, the deep neural networks (DNNs) are often perceived as black-boxes, making their decision less understandable to humans and prohibiting their usage in safety-critical applications. This guest editorial introduces the thirty papers accepted for the Special Issue on Explainable Deep Learning for Efficient and Robust Pat- tern Recognition. They are grouped into three main categories: explainable deep learning methods, effi- cient deep learning via model compression and acceleration, as well as robustness and stability in deep learning. For each of the three topics, a survey of the representative works and latest developments is presented, followed by the brief introduction of the accepted papers belonging to this topic. The special issue should be of high relevance to the reader interested in explainable deep learning methods for ef- ficient and robust pattern recognition applications and it helps promoting the future research directions in this field.

Access

Closed Access

Type of Publication

Journal article

Publisher

International Conference on Pattern Recognition

access here

24/08/2021

Inplace knowledge distillation with teacher assistant for improved training of flexible deep neural networks

access here

Author(s)

Alexey Ozerov; Ngoc Q. K. Duong

Institution

InterDigital

Abstract

Deep neural networks (DNNs) have achieved great success in various machine learning tasks. However, most existing powerful DNN models are computationally expensive and memory demanding, hindering their deployment in devices with low memory and computational resources or in applications with strict latency requirements. Thus, several resource-adaptable or flexible approaches were recently proposed that train at the same time a big model and several resource-specific sub-models. Inplace knowledge distillation (IPKD) became a popular method to train those models and consists in distilling the knowledge from a larger model (teacher) to all other sub-models (students). In this work a novel generic training method called IPKD with teacher assistant (IPKD-TA) is introduced, where sub-models themselves become teacher assistants teaching smaller sub-models. We evaluated the proposed IPKD-TA training method using two state-of-the-art flexible models (MSDNet and Slimmable MobileNet-V1) with two popular image classification benchmarks (CIFAR-10 and CIFAR-100). Our results demonstrate that the IPKD-TA is on par with the existing state of the art while improving it in most cases.

Access

Open Access

Type of Publication

Conference paper

Publisher

European Signal Processing Conference

access here

13/08/2021

Hebbian semi-supervised learning in a sample efficiency setting

access here

Author(s)

Claudio Gennaro; Fabrizio Falchi; Gabriele Lagani; Giuseppe Amato;

Institution

ISTI-CNR;

Abstract

We propose to address the issue of sample efficiency, in Deep Convolutional Neural Networks (DCNN), with a semi-supervised training strategy that combines Hebbian learning with gradient descent: all internal layers (both convolutional and fully connected) are pre-trained using an unsupervised approach based on Hebbian learning, and the last fully connected layer (the classification layer) is trained using Stochastic Gradient Descent (SGD). In fact, as Hebbian learning is an unsupervised learning method, its potential lies in the possibility of training the internal layers of a DCNN without labels. Only the final fully connected layer has to be trained with labeled examples. We performed experiments on various object recognition datasets, in different regimes of sample efficiency, comparing our semi-supervised (Hebbian for internal layers + SGD for the final fully connected layer) approach with end-to-end supervised backprop training, and with semi-supervised learning based on Variational Auto-Encoder (VAE). The results show that, in regimes where the number of available labeled samples is low, our semi-supervised approach outperforms the other approaches in almost all the cases.

Access

Open Access

Type of Publication

Journal article

Publisher

Neural Networks

access here

05/08/2021

AIMH at SemEval-2021 – Task 6: multimodal classification using an ensemble of transformer models

access here

Author(s)

Claudio Gennaro; Fabrizio Falchi; Giuseppe Amato; Nicola Messina;

Institution

ISTI-CNR;

Abstract

This paper describes the system used by the AIMH Team to approach the SemEval Task 6. We propose an approach that relies on an architecture based on the transformer model to process multimodal content (text and images) in memes. Our architecture, called DVTT (Double Visual Textual Transformer), approaches Subtasks 1 and 3 of Task 6 as multi-label classification problems, where the text and/or images of the meme are processed, and the probabilities of the presence of each possible persuasion technique are returned as a result. DVTT uses two complete networks of transformers that work on text and images that are mutually conditioned. One of the two modalities acts as the main one and the second one intervenes to enrich the first one, thus obtaining two distinct ways of operation. The two transformers outputs are merged by averaging the inferred probabilities for each possible label, and the overall network is trained end-to-end with a binary cross-entropy loss.

Access

Open Access

Type of Publication

Conference paper

Publisher

International Workshop on Semantic Evaluation

access here

28/07/2021

Curriculum self-paced learning for cross-domain object detection

access here

Author(s)

Niculae Sebe Paolo Rota; Petru Soviany; Radu Tudor Ionescu

Institution

University of Trento;

Abstract

Training (source) domain bias affects state-of-the-art object detectors, such as Faster R-CNN, when applied to new (target) domains. To alleviate this problem, researchers proposed various domain adaptation methods to improve object detection results in the cross-domain setting, e.g. by translating images with ground-truth labels from the source domain to the target domain using Cycle-GAN. On top of combining Cycle-GAN transformations and self-paced learning in a smart and efficient way, in this paper, we propose a novel self-paced algorithm that learns from easy to hard. Our method is simple and effective, without any overhead during inference. It uses only pseudo-labels for samples taken from the target domain, i.e. the domain adaptation is unsupervised. We conduct experiments on four cross-domain benchmarks, showing better results than the state of the art. We also perform an ablation study demonstrating the utility of each component in our framework. Additionally, we study the applicability of our framework to other object detectors. Furthermore, we compare our difficulty measure with other measures from the related literature, proving that it yields superior results and that it correlates well with the performance metric.

Access

Closed Access

Type of Publication

Journal article

Publisher

N/A

access here

26/07/2021

Hypespherical class prototypes for adversarial robustness

access here

Author(s)

Ioannis Pitas; Vasileios Mygdalis

Institution

Aristotle University of Thessaloniki;

Abstract

This work addresses the problem of adversarial robustness in deep neural network classification from an optimal class boundary estimation perspective. It is argued that increased model robustness to adversarial attacks can be achieved when the feature learning process is monitored by geometrically-inspired optimization criteria. To this end, we propose to learn hyperspherical class prototypes in the neural feature embedding space, along with training the network parameters. Three concurrent optimization functions for the intermediate hidden layer training data activations are devised, requiring items of the same class to be enclosed by the corresponding class prototype boundaries, to have minimum distance from their class prototype vector (i.e., hypersphere center) and to have maximum distance from the remainder hypersphere centers. Our experiments show that training standard classification model architectures with the proposed objectives, significantly increases their robustness to white-box adversarial attacks, without adverse (if not beneficial) effects to their classification accuracy.

Access

Open Access

Type of Publication

Report

Publisher

N/A

access here

26/07/2021

Introducing K-anonymity principles to adversarial attacks for privacy protection in image classification problems

access here

Author(s)

Anastasios Tefas; Ioannis Pitas; Vasileios Mygdalis

Institution

Aristotle University of Thessaloniki;

Abstract

The network output activation values for a given input can be employed to produce a sorted ranking. Adversarial attacks typically generate the least amount of perturbation required to change the classifier label. In that sense, generated adversarial attack perturbation only affects the output in the 1st sorted ranking position. We argue that meaningful information about the adversarial examples i.e., their original labels, is still encoded in the network output ranking and could potentially be extracted, using rule-based reasoning. To this end, we introduce a novel adversarial attack methodology inspired by the K-anonymity principles, that generates adversarial examples that are not only misclassified by the neural network classifier, but are uniformly spread along K different positions in the output sorted ranking. In order to regulate the introduced perturbation that arises from the strength of the proposed optimization objectives, an additional visual similarity-based loss function is introduced as well, guiding the adversarial examples towards directions maintaining visual similarity according the same objective metric, such as the CW-SSIM. Experimental results denote that the proposed approach achieves the optimization goals inspired by K-anonymity, while introducing reduced perturbation as well.

Access

Open Access

Type of Publication

Report

Publisher

N/A

access here

22/07/2021

Exploring Deep Fusion Ensembling for Automatic Visual Interestingness Prediction

access here

Author(s)

Bogdan Ionescu; Liviu-Daniel Stefan; Mihai Gabriel Constantin

Institution

University Politehnica of Bucharest

Abstract

In the context of the ever growing quantity of multimedia content from social, news and educational platforms, generating meaningful recommendations and ratings now requires a more advanced understanding of their impact on the user, such as their subjective perception. One of the important subjective concepts explored by researchers is visual interestingness. While several definitions of this concept are given in the current literature, in a broader sense, this property attempts to measure the ability of audio-visual data to capture and keep the viewer’s attention for longer periods of time. While many computer vision and machine learning methods have been tested for predicting media interestingness, overall, due to the heavily subjective nature of interestingness, the precision of the results is relatively low. In this chapter, we investigate several methods that address this problem from a different angle. We first review the literature on interestingness prediction and present an overview of the traditional fusion mechanisms, such as statistical fusion, weighted approaches, boosting, random forests or randomized trees. Further, we explore the possibility of employing a stronger, novel deep learning-based, system fusion for enhancing the performance. We investigate several types of deep networks for creating the fusion systems, including dense, attention, convolutional and cross-space-fusion networks, while also proposing some input decoration methods that help these networks achieve optimal performance.We present the results, as well as an analysis of the correlation between network structure and overall system performance. Experimental validation is carried out on a publicly available data set and on the systems benchmarked during the 2017 MediaEval Predicting Media Interestingness task.

Access

Open Access

Type of Publication

Section

Publisher

N/A

access here

18/07/2021

Whitening for Self-Supervised Representation Learning

access here

Author(s)

Alexandr Ermolov; Aliaksandr Siarohin; Enver Sangineto; Niculae Sebe

Institution

University of Trento;

Abstract

Most of the current self-supervised representation learning (SSL) methods are based on the contrastive
loss and the instance-discrimination task, where augmented versions of the same image instance (“positives”) are contrasted with instances extracted from other images (“negatives”). For the
learning to be effective, many negatives should be compared with a positive pair, which is computationally
demanding. In this paper, we propose a different direction and a new loss function for SSL, which is based on the whitening of the latentspace features. The whitening operation has a “scattering” effect on the batch samples, avoiding degenerate solutions where all the sample representations collapse to a single point. Our solution does not require asymmetric networks and it is conceptually simple. Moreover, since negatives are not needed, we can extract multiple positive pairs from the same image instance. The source code of the method and of all the experiments is available at: https://github.com/htdt/self-supervised.

Access

Open Access

Type of Publication

Conference paper

Publisher

International Conference on Machine Learning

access here

14/07/2021

Monte Carlo Elites: Quality-Diversity Selection as a Multi-Armed Bandit Problem

access here

Author(s)

Antonios Liapis; Georgios N. Yannakakis; Konstantinos Sfikas;

Institution

University of Malta

Abstract

A core challenge of evolutionary search is the need to balance between exploration of the search space and exploitation of highly fit regions. Quality-diversity search has explicitly walked this tightrope between a population’s diversity and its quality. This paper extends a popular quality-diversity search algorithm, MAP-Elites, by treating the selection of parents as a multi-armed bandit problem. Using variations of the upper-confidence bound to select parents from under-explored but potentially rewarding areas of the search space can accelerate the discovery of new regions as well as improve its archive’s total quality. The paper tests an indirect measure of quality for parent selection: the survival rate of a parent’s offspring. Results show that maintaining a balance between exploration and exploitation leads to the most diverse and high-quality set of solutions in three different testbeds.

Access

Open Access

Type of Publication

Conference paper

Publisher

Genetic and Evolutionary Computation Conference

access here

07/07/2021

The VISIONE Video Search System: Exploiting Off-the-Shelf Text Search Engines for Large-Scale Video Retrieval

access here

Author(s)

Claudio Gennaro; Claudio Vairo; Fabio Carrara; Fabrizio Falchi; Franca Debole; Giuseppe Amato; Lucia Vadicamo; Paolo Bolettieri;

Institution

ISTI-CNR;

Abstract

This paper describes in detail VISIONE, a video search system that allows users to search for videos using textual keywords, the occurrence of objects and their spatial relationships, the occurrence of colors and their spatial relationships, and image similarity. These modalities can be combined together to express complex queries and meet users’ needs. The peculiarity of our approach is that we encode all information extracted from the keyframes, such as visual deep features, tags, color and object locations, using a convenient textual encoding that is indexed in a single text retrieval engine. This offers great flexibility when results corresponding to various parts of the query (visual, text and locations) need to be merged. In addition, we report an extensive analysis of the retrieval performance of the system, using the query logs generated during the Video Browser Showdown (VBS) 2019 competition. This allowed us to fine-tune the system by choosing the optimal parameters and strategies from those we tested.

Access

Open Access

Type of Publication

Journal article

Publisher

Journal of Imaging

access here

07/07/2021

VISIONE at Video Browser Showdown 2021

access here

Author(s)

Claudio Gennaro; Claudio Vairo; Fabrizio Falchi; Giuseppe Amato; Lucia Vadicamo; Nicola Messina; Paolo Bolettieri;

Institution

ISTI-CNR;

Abstract

This paper presents the second release of VISIONE, a tool for effective video search on large-scale collections. It allows users to search for videos using textual descriptions, keywords, occurrence of objects and their spatial relationships, occurrence of colors and their spatial re- lationships, and image similarity. One of the main features of our system is that it employs specially designed textual encodings for indexing and searching video content using the mature and scalable Apache Lucene full-text search engine.

Access

Open Access

Type of Publication

Conference paper

Publisher

MultiMedia Modeling

access here

02/07/2021

One qubit as a Universal Approximant

access here

Author(s)

Adrián Pérez-Salinas Artur Garcia-Saez David López-Núñez José I. Latorre P. Forn-Díaz

Institution

Barcelona Institute of Science and Technology Barcelona Supercomputing Center; Center fo Quantum Technologies Qilimanjaro Quantum Tech Quantum Research Centre Universidad de Barcelona

Abstract

A single-qubit circuit can approximate any bounded complex function stored in the degrees of freedom defining its quantum gates. The single-qubit approximant presented in this work is operated through a series of gates that take as their parameterization the independent variable of the target function and an additional set of adjustable parameters. The independent variable is re-uploaded in every gate while the parameters are optimized for each target function. The output state of this quantum circuit becomes more accurate as the number of re-uploadings of the independent variable increases, i. e., as more layers of gates parameterized with the independent variable are applied. In this work, we provide two different proofs of this claim related to both Fourier series and the Universal Approximation Theorem for Neural Networks, and we benchmark both methods against their classical counterparts. We further implement a single-qubit approximant in a real superconducting qubit device, demonstrating how the ability to describe a set of functions improves with the depth of the quantum circuit. This work shows the robustness of the re-uploading technique on Quantum Machine Learning.

Access

Open Access

Type of Publication

Publication

Publisher

N/A

access here

26/06/2021

Pairwise Ranking Network for Affect Recognition

access here

Author(s)

Georgios Zoumpourlis; Ioannis Patras;

Institution

Queen Mary University of London;

Abstract

In this work we study the problem of emotion recognition under the prism of preference learning. Affective datasets are typically annotated by assigning a single absolute label, i.e. a numerical value that describes the intensity of an emotional attribute, to each sample. Then, the majority of existing works on affect recognition employ sample-wise classification/regression methods to predict affective states, using those annotations. We take a different approach and use a deep network architecture that performs joint training on the tasks of classification/regression of samples and ordinal ranking between pairs of samples. By treating input samples in a pairwise manner, we leverage the auxiliary task of inferring the ordinal relation between their corresponding affective states. Incorporating the ranking objective allows capturing the inherently ordinal structure of emotions and learning the inter-sample relations, resulting in better generalization. Our method is incorporated into existing affect recognition architectures and evaluated on datasets of electroencephalograms (EEG) and images. We show that the approach proposed in this work leads to consistent performance gains when incorporated in classification/regression networks.

Access

Open Access

Type of Publication

Conference paper

Publisher

International Conference on ACII

access here

26/06/2021

Total Generate: Cycle in Cycle Generative Adversarial Networks for Generating Human Faces, Hands, Bodies, and Natural Scenes

access here

Author(s)

Hao Tang; Nicu Sebe;

Institution

University of Trento;

Abstract

We propose a novel and unified Cycle in Cycle Generative Adversarial Network (C2GAN) for generating human faces, hands, bodies, and natural scenes. Our proposed C2GAN is a cross-modal model exploring the joint exploitation of the input image data and guidance data in an interactive manner. C2GAN contains two different generators, i.e., an image-generation generator and a guidance-generation generator. Both generators are mutually connected and trained in an end-to-end fashion and explicitly form three cycled subnets, i.e., one image generation cycle and two guidance generation cycles. Each cycle aims at reconstructing the input domain and simultaneously produces a useful output involved in the generation of another cycle. In this way, the cycles constrain each other implicitly providing complementary information from both image and guidance modalities and bringing an extra supervision gradient across the cycles, facilitating a more robust optimization of the whole model. Extensive results on four guided image-to-image translation subtasks demonstrate that the proposed C2GAN is effective in generating more realistic images compared with state-of-the-art models.

Access

Closed Access

Type of Publication

Journal article

Publisher

IEEE Transactions on Multimedia

access here

23/06/2021

Learning to Attack Real-World Models for Person Re-identification via Virtual-Guided Meta-Learning

access here

Author(s)

Fengxiang Yang; Hong Liu; Nicu Sebe; Shaozi Li; Shin'ichi Satoh; Zheng Wang; Zhiming Luo; Zhun Zhong

Institution

National Institute of Informatics of Tokyo; University of Tokyo; University of Trento; Xiamen University

Abstract

Recent advances in person re-identification (re-ID) have led to impressive retrieval accuracy. However, existing re-ID models are challenged by the adversarial examples crafted by adding quasi-imperceptible perturbations. Moreover, re-ID systems face the domain shift issue that training and testing domains are not consistent. In this study, we argue that learning powerful attackers with high universality that works well on unseen domains is an important step in promoting the robustness of re-ID systems. Therefore, we introduce a novel universal attack algorithm called “MetaAttack” for person re-ID. MetaAttack can mislead re-ID models on unseen domains by a universal adversarial perturbation. Specifically, to capture common patterns across different domains, we propose a meta-learning scheme to seek the universal perturbation via the gradient interaction between meta-train and meta-test formed by two datasets. We also take advantage of a virtual dataset (PersonX), instead of real ones, to conduct meta-test. This scheme not only enables us to learn with more comprehensive variation factors but also mitigates the negative effects caused by biased factors of real datasets. Experiments on three large-scale re-ID datasets demonstrate the effectiveness of our method in attacking re-ID models on unseen domains. Our final visualization results reveal some new properties of existing re-ID systems, which can guide us in designing a more robust re-ID model. Code and supplemental material are available at https://github.com/FlyingRoastDuck/MetaAttack AAAI21.

Access

Open Access

Type of Publication

Conference paper

Publisher

Conference on Artificial Intelligence

access here

22/06/2021

Deepfake Video Detection with Facial Features and Long-Short Term Memory Deep Networks

access here

Author(s)

Bogdan Ionescu; Cristian Stanciu;

Institution

University Politehnica of Bucharest

Abstract

Generative models have evolved immensely in the last few years. GAN-based video and image generation has become very accessible due to open source software available to anyone, and that may pose a threat to society. Deepfakes can be used to intimidate, blackmail certain public figures or to mislead the public. At the same time, with the rising popularity of deepfakes, detection algorithms have also evolved significantly. The majority of those algorithms focus on images rather than to explore the temporal evolution in the video. In this paper, we explore whether the temporal information of the video can be used to increase the performance of state-of-the-art deepfake detection algorithms. We also investigate whether certain facial regions contain more information about the authenticity of the video by using the entire aligned face as input for our model and by only selecting certain facial regions. We use late fusion to combine those results for increased performance. To validate our solution, we experiment on 2 state-of-the-art datasets, namely FaceForensics++ and CelebDF. The results show that using the temporal dimension can greatly enhance the performance of a deep learning model.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

22/06/2021

Smoothing the Disentangled Latent Style Space for Unsupervised Image-to-Image Translation

access here

Author(s)

Bruno Lepri; Enver Sangineto; Haoxian Zhang; Linchao Bao; Marco de Nadai; Nicu Sebe; Wei Wang; Yahui Liu Yajing Chen;

Institution

Fondazione Bruno Kessler; Tencent AI Lab University of Trento;

Abstract

Image-to-Image (I2I) multi-domain translation models are usually evaluated also using the quality of their semantic interpolation results. However, state-of-the-art models frequently show abrupt changes in the image appearance during interpolation, and usually perform poorly in interpolations across domains. In this paper, we propose a new training protocol based on three specific losses which help a translation network to learn a smooth and disentangled latent style space in which: 1) Both intra- and inter-domain interpolations correspond to gradual changes in the generated images and 2) The content of the source image is better preserved during the translation. Moreover, we propose a novel evaluation metric to properly measure the smoothness of latent style space of I2I translation models. The proposed method can be plugged in existing translation approaches, and our extensive experiments on different datasets show that it can significantly boost the quality of the generated images and the graduality of the interpolations.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

22/06/2021

Neighborhood Contrastive Learning for Novel Class Discovery

access here

Author(s)

Elisa Ricci; Enrico Fini; Nicu Sebe; Subhankar Roy; Zhiming Luo; Zhun Zhong

Institution

Fondazione Bruno Kessler; University of Trento; Xiamen University

Abstract

In this paper, we address Novel Class Discovery (NCD), the task of unveiling new classes in a set of unlabeled samples given a labeled dataset with known classes. We exploit the peculiarities of NCD to build a new framework, named Neighborhood Contrastive Learning (NCL), to learn discriminative representations that are important to clustering performance. Our contribution is twofold. First, we find that a feature extractor trained on the labeled set generates representations in which a generic query sample and
its neighbors are likely to share the same class. We exploit this observation to retrieve and aggregate pseudo-positive pairs with contrastive learning, thus encouraging the model to learn more discriminative representations. Second, we notice that most of the instances are easily discriminated by the network, contributing less to the contrastive loss. To overcome this issue, we propose to generate hard negatives by mixing labeled and unlabeled samples in the feature space. We experimentally demonstrate that these two ingredients significantly contribute to clustering performance and lead our model to outperform state-of-the-art methods by a large margin (e.g., clustering accuracy +13% on CIFAR-100 and +8% on ImageNet).

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

22/06/2021

OpenMix: Reviving Known Knowledge for Discovering Novel Visual Categories in an Open World

access here

Author(s)

Linchao Zhu; Nicu Sebe; Shaozi Li; Yi Yang; Zhiming Luo; Zhun Zhong

Institution

University of Technology Sydney; University of Trento; Xiamen University

Abstract

In this paper, we tackle the problem of discovering new classes in unlabeled visual data given labeled data from disjoint classes. Existing methods typically first pre-train a model with labeled data, and then identify new classes in unlabeled data via unsupervised clustering. However, the labeled data that provide essential knowledge are often underexplored in the second step. The challenge is that the labeled and unlabeled examples are from non-overlapping classes, which makes it difficult to build a learning relationship between them. In this work, we introduce OpenMix to mix the unlabeled examples from an open set and the labeled examples from known classes, where their nonoverlapping labels and pseudo-labels are simultaneously mixed into a joint label distribution. OpenMix dynamically compounds examples in two ways. First, we produce mixed training images by incorporating labeled examples with unlabeled examples. With the benefit of unique prior knowledge in novel class discovery, the generated pseudo-labels will be more credible than the original unlabeled predictions. As a result, OpenMix helps preventing the model from overfitting on unlabeled samples that may be assigned with wrong pseudo-labels. Second, the first way encourages the unlabeled examples with high class-probabilities to have considerable accuracy. We introduce these examples as reliable anchors and further integrate them with unlabeled samples. This enables us to generate more combinations in unlabeled examples and exploit finer object relations among the new classes. Experiments on three classification datasets demonstrate the effectiveness of the proposed OpenMix, which is superior to state-of-the-art methods in novel class discovery.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

22/06/2021

Joint Noise-Tolerant Learning and Meta Camera Shift Adaptation for Unsupervised Person Re-Identification

access here

Author(s)

Fengxiang Yang; Nicu Sebe; Shaozi Li; Yaojing Lin; Yuanzheng Cai; Zhiming Luo; Zhun Zhong

Institution

Minjaing University; Minnan Normal University; University of Trento; Xiamen University

Abstract

This paper considers the problem of unsupervised person re-identification (re-ID), which aims to learn discriminative models with unlabeled data. One popular method is to obtain pseudo-label by clustering and use them to optimize the model. Although this kind of approach has shown promising accuracy, it is hampered by 1) noisy labels produced by clustering and 2) feature variations caused by camera shift. The former will lead to incorrect optimization and thus hinders the model accuracy. The latter will result in assigning the intra-class samples of different cameras to different pseudo-label, making the model sensitive to camera variations. In this paper, we propose a unified framework to solve both problems. Concretely, we propose a Dynamic and Symmetric Cross Entropy loss (DSCE) to deal with noisy samples and a camera-aware meta-learning algorithm (MetaCam) to adapt camera shift. DSCE can alleviate the negative effects of noisy samples and accommodate the change of clusters after each clustering step. MetaCam simulates cross-camera constraint by splitting the training data into meta-train and meta-test based on camera IDs.
With the interacted gradient from meta-train and meta-test, the model is enforced to learn camera-invariant features. Extensive experiments on three re-ID benchmarks show the effectiveness and the complementary of the proposed DSCE and MetaCam. Our method outperforms the state-of-the-art methods on both fully unsupervised re-ID and unsupervised domain adaptive re-ID.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

22/06/2021

The AI4Media project: Use of Next-generation Artificial Intelligence Technologies for Media Sector Applications

access here

Author(s)

Alexey Ozerov; Antonios Liapis; Artur Garcia-Saez; Birgit Gray; Danae Tsabouraki; Daniele Gravina; Filareti Tsalakanidou; Francois Schnitzler; Fulvio Negro; Georgi Kostadinov; Georgios N. Yannakakis; Ioannis Kompatsiaris; Jesse de Vos; Maritini Kalogerini; Maurizio Montagnuolo; Philo van Kemenade; Rémi Mignot; Symeon Papadopoulos Vasileios Mezaris

Institution

Athens Technology Center; Barcelona Supercomputing Center; CERTH - Center for Research and Technology Hellas Deutsche Welle; Imagga Technologies Lda.; InterDigital IRCAM; Modl.ai; Netherlands Institute for Sound & Vision RAI; University of Malta

Abstract

Artificial Intelligence brings exciting innovations in all aspects of life and creates new opportunities across industry sectors. At the same time, it raises significant questions in terms of trust, ethics, and accountability. This paper offers an introduction to the AI4Media project, which aims to build on recent advances of AI in order to offer innovative tools to the media sector. AI4Media unifies the fragmented landscape of media-related AI technologies by investigating new learning paradigms and distributed AI, exploring issues of AI explainability, robustness and privacy, examining AI techniques for content analysis, and exploiting AI to address major societal challenges. In this paper, we focus on our vision of how such AI technologies can reshape the media sector, by discussing seven industrial use cases that range from combating disinformation in social media and supporting journalists for news story creation, to high quality video production, game design, and artistic co-creation. For each of these use cases, we highlight the present challenges and needs, and explain how they can be efficiently addressed by using innovative AI-driven solutions.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

21/06/2021

Hateful meme detection with multimodal deep neural networks

access here

Author(s)

Bogdan Ionescu; Cristian Stanciu; Dan-Ștefan Pârvu Denisa Ionaşcu; Mihai Gabriel Constantin

Institution

University Politehnica of Bucharest

Abstract

The modern advances of social media platforms and content sharing websites led to the popularization of Internet memes, and today’s Internet landscape contains websites that are predominantly dedicated to meme sharing. While at their inception memes were mostly humorous, this concept evolved and nowadays memes cover a wide variety of subjects, including political and social commentaries. Considering the widespread use of memes and their power of conveying distilled messages, they became an important method for spreading hate speech against individuals or targeted groups. Given the multimodal nature of Internet memes, our proposed approach is also a multimodal one, consisting of two parallel processing branches, one textual and one visual, that are joined in a final classification step, providing prediction results for the samples. We test our approach on the publicly available Memotion 7k dataset and compare our results with the baseline approach developed for the dataset.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

16/06/2021

ObjectGraphs: Using Objects and a Graph Convolutional Network for the Bottom-up Recognition and Explanation of Events in Video

access here

Author(s)

Andreas Goulas; Damianos Galanopoulos; Nikolaos Gkalelis; Vasileios Mezaris

Institution

CERTH - Center for Research and Technology Hellas

Abstract

In this paper a novel bottom-up video event recognition approach is proposed, ObjectGraphs, which utilizes a rich frame representation and the relations between objects within each frame. Following the application of an object detector (OD) on the frames, graphs are used to model the object relations and a graph convolutional network (GCN) is utilized to perform reasoning on the graphs. The resulting object-based frame-level features are then forwarded to a long short-term memory (LSTM) network for video event recognition. Moreover, the weighted in-degrees (WiDs) derived from the graph’s adjacency matrix at frame level are used for identifying the objects that were considered most (or least) salient for event recognition and contributed the most (or least) to the final event recognition decision, thus providing an explanation for the latter. The experimental results show that the proposed method achieves state-of-the-art performance on the publicly available FCVID and YLI-MED datasets. Source code for our ObjectGraphs method is made publicly available at: https://github.com/bmezaris/ObjectGraphs

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

05/06/2021

Playable Video Generation

access here

Author(s)

Aliaksandr Siarohin; Elisa Ricci; Sergey Tulyakov; Stéphane Lathuilière; Willi Menapace;

Institution

Fondazione Bruno Kessler; Institut Polytechnique de Paris; Snap Inc.; University of Trento;

Abstract

This paper introduces the unsupervised learning problem of playable video generation (PVG). In PVG, we aim at allowing a user to control the generated video by selecting a discrete action at every time step as when playing a video game. The difficulty of the task lies both in learning semantically consistent actions and in generating realistic videos conditioned on the user input. We propose a novel framework for PVG that is trained in a self-supervised manner on a large dataset of unlabelled videos. We employ an encoder-decoder architecture where the predicted action labels act as bottleneck. The network is constrained to learn a rich action space using, as main driving loss, a reconstruction loss on the generated video. We demonstrate the effectiveness of the proposed approach on several datasets with wide environment variety. Further details, code and examples are available on our project page: willimenapace.github.io/playable-video-generation-website

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

14/05/2021

Challenges for Automatic Detection of Fake News Related to Migration

access here

Author(s)

Dorothea Thomas-Aniola; Georg Thallinger; Gerhard Backfried; Werner Bailer;

Institution

HENSOLDT Analytics; Joanneum Research;

Abstract

Fake news and misinformation is a widespread phenomenon these days, affecting social media, alternative and traditional media. In a climate of increasing polarization and perceived societal injustice, the topic of migration is one domain that is frequently the target of fake news, addressing both migrants and citizens in host countries. The problem is inherently a multi-lingual and multi-modal one in that it involves information in an array of languages, material in textual, visual and auditory form and often involves communication in a language which may be unfamiliar to recipients or which these recipients only may have basic knowledge of. We argue that semi-automatic approaches, empowering users to gain a clearer picture and base their decisions on sound information, are needed to counter the problem of misinformation. In order to deal with the scale of the problem, such approaches involve a variety of technologies from the field of Artificial Intelligence (AI). In this paper we identify a number of challenges related to implementing approaches for the detection of fake news in the context of migration. These include collecting multi-lingual and multi-modal datasets related to the migration domain, providing explanations of AI tools used in verification to both media professionals and consumers. Further efforts in truly collaborative AI will be needed.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

13/05/2021

TweepFake: About detecting deepfake tweets

access here

Author(s)

Antonio Martella; Fabrizio Falchi; Margherita Gambini; Maurizio Tesconi; Tiziano Fagni;

Institution

ISTI-CNR; University of Trento;

Abstract

The recent advances in language modeling significantly improved the generative capabilities of deep neural models: in 2019 OpenAI released GPT-2, a pre-trained language model that can autonomously generate coherent, non-trivial and human-like text samples. Since then, ever more powerful text generative models have been developed. Adversaries can exploit these tremendous generative capabilities to enhance social bots that will have the ability to write plausible deepfake messages, hoping to contaminate public debate. To prevent this, it is crucial to develop deepfake social media messages detection systems. However, to the best of our knowledge no one has ever addressed the detection of machine-generated texts on social networks like Twitter or Facebook. With the aim of helping the research in this detection field, we collected the first dataset of real deepfake tweets, TweepFake. It is real in the sense that each deepfake tweet was actually posted on Twitter. We collected tweets from a total of 23 bots, imitating 17 human accounts. The bots are based on various generation techniques, i.e., Markov Chains, RNN, RNN+Markov, LSTM, GPT-2. We also randomly selected tweets from the humans imitated by the bots to have an overall balanced dataset of 25,572 tweets (half human and half bots generated). The dataset is publicly available on Kaggle. Lastly, we evaluated 13 deepfake text detection methods (based on various state-of-the-art approaches) to both demonstrate the challenges that Tweepfake poses and create a solid baseline of detection techniques. We hope that TweepFake can offer the opportunity to tackle the deepfake detection on social media messages as well.

Access

Open Access

Type of Publication

Journal article

Publisher

PLOS ONE

access here

07/05/2021

A network view on reliability: using machine learning to understand how we assess news websites

access here

Author(s)

Tobias Blanke; Tommaso Venturini;

Institution

Center for Internet and Society of Paris; University of Amsterdam;

Abstract

This article shows how a machine can employ a network view to reason about complex social relations of news reliability. Such a network view promises a topic-agnostic perspective that can be a useful hint on reliability trends and their heterogeneous assumptions. In our analysis, we depart from the ever-growing numbers of papers trying to find machine learning algorithms to predict the reliability of news and focus instead on using machine reasoning to understand the structure of news networks by comparing it with our human judgements. Understanding and representing news networks is not easy, not only because they can be extremely vast but also because they are shaped by several overlapping network dynamics. We present a machine learning approach to analyse what constitutes reliable news from the view of a network. Our aim is to machine-read a network’s understanding of news reliability. To analyse real-life news sites, we used the Décodex dataset to train machine learning models from the structure of the underlying network. We then employ the models to draw conclusions how the Décodex evaluators came to assess the reliability of news.

Access

Open Access

Type of Publication

Journal article

Publisher

Journal of Computational Social Science

access here

04/05/2021

Assessing pattern recognition performance of neuronal cultures through accurate simulation

access here

Author(s)

Claudio Gennaro; Fabrizio Falchi; Federico Cremisi; Gabriele Lagani; Giuseppe Amato; Marco Cicchini Guido; Raffaele Mazziotti; Tommaso Pizzorusso

Institution

ISTI-CNR;

Abstract

Previous work has shown that it is possible to train neuronal cultures on Multi-Electrode Arrays (MEAs), to recognize very simple patterns. However, this work was mainly focused to demonstrate that it is possible to induce plasticity in cultures, rather than performing a rigorous assessment of their pattern recognition performance. In this paper, we address this gap by developing a methodology that allows us to assess the performance of neuronal cultures on a learning task. Specifically, we propose a digital model of the real cultured neuronal networks; we identify biologically plausible simulation parameters that allow us to reliably reproduce the behavior of real cultures; we use the simulated culture to perform handwritten digit recognition and rigorously evaluate its performance; we also show that it is possible to find improved simulation parameters for the specific task, which can guide the creation of real cultures.

Access

Open Access

Type of Publication

Conference paper

Publisher

Conference on Neural Engineering

access here

03/05/2021

Few-Shot Bayesian Optimization with Deep Kernel Surrogates

access here

Author(s)

Josif Grabocka; Martin Wistuba;

Institution

IBM Research; University of Freiburg;

Abstract

Hyperparameter optimization (HPO) is a central pillar in the automation of machine learning solutions and is mainly performed via Bayesian optimization, where a parametric surrogate is learned to approximate the black box response function (e.g. validation error). Unfortunately, evaluating the response function is computationally intensive. As a remedy, earlier work emphasizes the need for transfer learning surrogates which learn to optimize hyperparameters for an algorithm from other tasks. In contrast to previous work, we propose to rethink HPO as a few-shot learning problem in which we train a shared deep surrogate model to quickly adapt (with few response evaluations) to the response function of a new task. We propose the use of a deep kernel network for a Gaussian process surrogate that is meta-learned in an end-to-end fashion in order to jointly approximate the response functions of a collection of training data sets. As a result, the novel few-shot optimization of our deep kernel surrogate leads to new state-of-the-art results at HPO compared to several recent methods on diverse metadata sets.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

03/05/2021

SkipW: Resource Adaptable RNN with Strict Upper Computational Limit

access here

Author(s)

Anne Lambert; Françoise Le Bolzer; Pascal Le Guyadee; Tsiry Mayet;

Institution

InterDigital

Abstract

We introduce Skip-Window, a method to allow recurrent neural networks (RNNs) to trade off accuracy for computational cost during the analysis of a sequence. Similarly to existing approaches, Skip-Window extends existing RNN cells by adding a mechanism to encourage the model to process fewer inputs. Unlike existing approaches, Skip-Window is able to respect a strict computational budget, making this model more suitable for limited hardware like edge devices. We evaluate this approach on four datasets: a human activity recognition task, sequential MNIST, IMDB and adding task. Our results show that Skip-Window is often able to exceed the accuracy of existing approaches for a lower computational cost while strictly limiting said cost.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

01/05/2021

When Dictionary Learning Meets Deep Learning: Deep Dictionary Learning and Coding Network for Image Recognition With Limited Data

access here

Author(s)

Hao Tang; Hong Liu; Nicu Sebe; Wei Xiao;

Institution

Peking University; University of Science and Technology of Shenzhen; University of Trento;

Abstract

We present a new deep dictionary learning and coding network (DDLCN) for image-recognition tasks with limited data. The proposed DDLCN has most of the standard deep learning layers (e.g., input/output, pooling, and fully connected), but the fundamental convolutional layers are replaced by our proposed compound dictionary learning and coding layers. The dictionary learning learns an overcomplete dictionary for input training data. At the deep coding layer, a locality constraint is added to guarantee that the activated dictionary bases are close to each other. Then, the activated dictionary atoms are assembled and passed to the compound dictionary learning and coding layers. In this way, the activated atoms in the first layer can be represented by the deeper atoms in the second dictionary. Intuitively, the second dictionary is designed to learn the fine-grained components shared among the input dictionary atoms; thus, a more informative and discriminative low-level representation of the dictionary atoms can be obtained. We empirically compare DDLCN with several leading dictionary learning methods and deep learning models. Experimental results on five popular data sets show that DDLCN achieves competitive results compared with state-of-the-art methods when the training data are limited. Code is available at https://github.com/Ha0Tang/DDLCN.

Access

Closed Access

Type of Publication

Journal article

Publisher

IEEE Transactions on Neural Networks and Learning Systems

access here

05/04/2021

A New Method from a Re-examination of Deep Architectures for Head Motion Prediction in 360-degree Videos

access here

Author(s)

Frédéric Precioso; Lucile Sassatelli; Miguel Rondon; Ramon Aparicio-Pardo;

Institution

Université Côte d'Azur;

Abstract

We consider predicting the user’s head motion in 360° videos, with 2 modalities only: the past user’s positions and the video content (not knowing other users’ traces). We make two main contributions. First, we re-examine existing deep-learning approaches for this problem and identify hidden flaws from a thorough root-cause analysis. Second, from the results of this analysis, we design a new proposal establishing state-of-the-art performance.
First, re-assessing the existing methods that use both modalities, we obtain the surprising result that they all perform worse than baselines using the user’s trajectory only. A root-cause analysis of the metrics, datasets and neural architectures shows in particular that (i) the content can inform the prediction for horizons longer than 2 to 3 sec. (existing methods consider shorter horizons), and that (ii) to compete with the baselines, it is necessary to have a recurrent unit dedicated to process the positions, but this is not sufficient.
Second, from a re-examination of the problem supported with the concept of Structural-RNN, we design a new deep neural architecture, named TRACK. TRACK achieves state-of-the-art performance on all considered datasets and prediction horizons, outperforming competitors by up to 20% on focus-type videos and horizons 2-5 seconds.
The entire framework (codes and datasets) is online and received an ACM reproducibility badge https://gitlab.com/miguelfromeror/head-motion-prediction

Access

Open Access

Type of Publication

Journal article

Publisher

N/A

access here

30/03/2021

The 2021 ImageCLEF Benchmark: Multimedia Retrieval in Medical, Nature, Internet and Social Media Applications

access here

Author(s)

Adrian Clark; Adrian Popescu; Alba Seco de Herrera; Andrei Tauteanu; Antonio Campello; Asma Ben Abacha; Bogdan Ionescu; Christoph Griedrich; Dimitri Fichou; Dina Demner-Fushman; Hassan Moustahdif; Henning Müller; Janadhip Jacutprakart; Jérôme Deshayes-Chossart; Jon Chamberlain; Liviu-Daniel Stefan; Mihai Dogariu Mihai Gabriel Constantin Mourad Sarrouti; Obioma Pelka; Paul Brie; Raul Berari; Renaud Péteri; Sadid A. Hasan; Serge Kozlovski; Thomas Oliver; Vassili Kovalev; Vitali Lianchuk; Yashin Dicente Cid;

Institution

Abigail Schulz; Belarussian Academy of Sciences; CVS Health; La Rochelle University; National Library of Medicine; NOAA/US IOOS; teleportHQ; Unites Institute of Informatics Problems; Université Paris-Saclay; University of Applied Sciences and Art Dortmund; University of Applied Sciences of Western Switzerland; University of Essex; University of Warwick; University Politehnica of Bucharest Wellcome Trust;

Abstract

This paper presents the ideas for the 2021 ImageCLEF lab that will be organized as part of the Conference and Labs of the Evaluation Forum—CLEF Labs 2021 in Bucharest, Romania. ImageCLEF is an ongoing evaluation initiative (active since 2003) that promotes the evaluation of technologies for annotation, indexing and retrieval of visual data with the aim of providing information access to large collections of images in various usage scenarios and domains. In 2021, the 19th edition of ImageCLEF will organize four main tasks: (i) a Medical task addressing visual question answering, a concept annotation and a tuberculosis classification task, (ii) a Coral task addressing the annotation and localisation of substrates in coral reef images, (iii) a DrawnUI task addressing the creation of websites from either a drawing or a screenshot by detecting the different elements present on the design and a new (iv) Aware task addressing the prediction of real-life consequences of online photo sharing. The strong participation in 2020, despite the COVID pandemic, with over 115 research groups registering and 40 submitting over 295 runs for the tasks shows an important interest in this benchmarking campaign. We expect the new tasks to attract at least as many researchers for 2021.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

28/03/2021

Re-Assessing the “Classify and Count” Quantification Method

access here

Author(s)

Alejandro Moreo; Fabrizio Sebastiani;

Institution

ISTI-CNR;

Abstract

Learning to quantify (a.k.a. quantification) is a task concerned with training unbiased estimators of class prevalence via supervised learning. This task originated with the observation that “Classify and Count” (CC), the trivial method of obtaining class prevalence estimates, is often a biased estimator, and thus delivers suboptimal quantification accuracy. Fol- lowing this observation, several methods for learning to quantify have been proposed and have been shown to outperform CC. In this work we contend that previous works have failed to use properly optimised versions of CC. We thus reassess the real merits of CC and its variants, and argue that, while still inferior to some cutting-edge methods, they deliver near-state-of-the- art accuracy once (a) hyperparameter optimisation is performed, and (b) this optimisation is performed by using a truly quantification-oriented evaluation protocol. Experiments on three publicly available binary sentiment classification datasets support these conclusions.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

25/03/2021

Playing Against the Board: Rolling Horizon Evolutionary Algorithms Against Pandemic

access here

Author(s)

Antonios Liapis; Konstantinos Sfikas;

Institution

University of Malta

Abstract

Competitive board games have provided a rich and diverse testbed for artificial intelligence. This paper contends that collaborative board games pose a different challenge to artificial intelligence as it must balance short-term risk mitigation with long-term winning strategies. Collaborative board games task all players to coordinate their different powers or pool their resources to overcome an escalating challenge posed by the board and a stochastic ruleset. This paper focuses on the exemplary collaborative board game Pandemic and presents a rolling horizon evolutionary algorithm designed specifically for this game. The complex way in which the Pandemic game state changes in a stochastic but predictable way required a number of specially designed forward models, macro-action representations for decision-making, and repair functions for the genetic operations of the evolutionary algorithm. Variants of the algorithm which explore optimistic versus pessimistic game state evaluations, different mutation rates and event horizons are compared against a baseline hierarchical policy agent. Results show that an evolutionary approach via short-horizon rollouts can better account for the future dangers that the board may introduce, and guard against them. Results highlight the types of challenges that collaborative board games pose to artificial intelligence, especially for handling multi-player collaboration interactions.

Access

Open Access

Type of Publication

Journal article

Publisher

IEEE Transactions on Games

access here

22/03/2021

Heterogeneous Document Embeddings for Cross-Lingual Text Classification

access here

Author(s)

Alejandro Moreo; Andrea Pedrotti; Fabrizio Sebastiani;

Institution

ISTI-CNR;

Abstract

Funnelling (Fun) is a method for cross-lingual text classification (CLC) based on a two-tier ensemble for heterogeneous transfer learning. In Fun, 1st-tier classifiers, each working on a different, language-dependent feature space, return a vector of calibrated posterior probabilities (with one dimension for each class) for each document, and the final classification decision is taken by a meta- classifier that uses this vector as its input. The metaclassifier can thus exploit class-class correlations, and this (among other things) gives Fun an edge over CLC systems where these correlations cannot be leveraged.

We here describe Generalized Funnelling (gFun), a learning ensemble where the metaclassifier receives as input the above vector of calibrated posterior probabilities, concatenated with document embeddings (aligned across languages) that embody other types of correlations, such as word-class correlations (as encoded by Word-Class Embeddings) and word-word correlations (as encoded by Multilingual Unsupervised or Supervised Embeddings). We show that gFun improves on Fun by describing experiments on two large, standard multilingual datasets for multi-label text classification.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

17/03/2021

Optical Flow based CNN for detection of unlearnt deepfake manipulations

access here

Author(s)

Alberto Del Bimbo; Irene Amerini; Leonardo Galteri; Roberto Cardelli;

Institution

CNIT; University of Florence; University of Rome La Sapienza;

Abstract

A new phenomenon named Deepfakes constitutes a serious threat in video manipulation. AI-based technologies have provided easy-to-use methods to create extremely realistic videos. On the side of multimedia forensics, being able to individuate this kind of fake contents becomes ever more crucial. In this work, a new forensic technique able to detect fake and original video sequences is proposed; it is based on the use of CNNs trained to distinguish possible motion dissimilarities in the temporal structure of a video sequence by exploiting optical flow fields. The results obtained highlight comparable performances with the state-of-the-art methods which, in general, only resort to single video frames. Furthermore, the proposed optical flow based detection scheme also provides a superior robustness in the more realistic cross-forgery operative scenario and can even be combined with frame-based approaches to improve their global effectiveness.

Access

Closed Access

Type of Publication

Journal article

Publisher

N/A

access here

22/02/2021

Visual Interestingness Prediction: A Benchmark Framework and Literature Review

access here

Author(s)

Bogdan Ionescu; Claire-Hélène Demarty; Liviu-Daniel Stefan; Mats Sjöberg; Mihai Gabriel Constantin Ngoc Q. K. Duong

Institution

CSC - IT Center for Science; InterDigital University Politehnica of Bucharest

Abstract

In this paper, we report on the creation of a publicly available, common evaluation framework for image and video visual interestingness prediction. We propose a robust data set, the Interestingness10k, with 9831 images and more than 4 h of video, interestigness scores determined based on more than 1M pair-wise annotations of 800 trusted annotators, some pre-computed multi-modal descriptors, and 192 system output results as baselines. The data were validated extensively during the 2016–2017 MediaEval benchmark campaigns. We provide an in-depth analysis of the crucial components of visual interestingness prediction algorithms by reviewing the capabilities and the evolution of the MediaEval benchmark systems, as well as of prominent systems from the literature. We discuss overall trends, influence of the employed features and techniques, generalization capabilities and the reliability of results. We also discuss the possibility of going beyond state-of-the-art performance via an automatic, ad-hoc system fusion, and propose a deep MLP-based architecture that outperforms the current state-of-the-art systems by a large margin. Finally, we provide the most important lessons learned and insights gained.

Access

Open Access

Type of Publication

Journal article

Publisher

International Journal of Computer Vision

access here

19/02/2021

Regular Polytope Networks

access here

Author(s)

Alberto Del Bimbo; Claudio Baecchi; Federico Pernici; Matteo Bruni;

Institution

Università degli Studi di Firenze;

Abstract

Neural networks are widely used as a model for classification in a large variety of tasks. Typically, a learnable transformation (i.e., the classifier) is placed at the end of such models returning a value for each class used for classification. This transformation plays an important role in determining how the generated features change during the learning process. In this work, we argue that this transformation not only can be fixed (i.e., set as nontrainable) with no loss of accuracy and with a reduction in memory usage, but it can also be used to learn stationary and maximally separated embeddings. We show that the stationarity of the embedding and its maximal separated representation can be theoretically justified by setting the weights of the fixed classifier to values taken from the coordinate vertices of the three regular polytopes available in Rd, namely, the d-Simplex, the d-Cube, and the d-Orthoplex. These regular polytopes have the maximal amount of symmetry that can be exploited to generate stationary features angularly centered around their corresponding fixed weights. Our approach improves and broadens the concept of a fixed classifier, recently proposed by Hoffer et al., to a larger class of fixed classifier models. Experimental results confirm the theoretical analysis, the generalization capability, the faster convergence, and the improved performance of the proposed method. Code will be publicly available.

Access

Open Access

Type of Publication

Journal article

Publisher

IEEE Transactions on Neural Networks and Learning Systems

access here

17/02/2021

Evaluation and Comparison of CNN Visual Explanations for Histopathology

access here

Author(s)

Henning Müller; Mara Graziani; Thomas Lompech; Vincent Andrearczy;

Institution

INP-ENSEEIHT; University of Applied Sciences of Western Switzerland; University of Geneva;

Abstract

Visualization methods for Convolutional Neural Net-works (CNNs) are spreading within the medical com-munity to obtain explainable AI (XAI). The sole quali-tative assessment of the explanations is subject to a riskof confirmation bias. This paper proposes a methodol-ogy for the quantitative evaluation of common visual-ization approaches for histopathology images, i.e. ClassActivation Mapping and Local-Interpretable Model-Agnostic Explanations. In our evaluation, we proposeto assess four main points, namely the alignment withclinical factors, the agreement between XAI methods,the consistency and repeatability of the explanations. Todo so, we compare the intersection over union of multi-ple visualizations of the CNN attention with the seman-tic annotation of functionally different nuclei types. Theexperimental results do not show stronger attributions tothe multiple nuclei types than those of a randomly ini-tialized CNN. The visualizations hardly agree on salientareas and LIME outputs have particularly unstable re-peatability and consistency. The qualitative evaluationalone is thus not sufficient to establish the appropriate-ness and reliability of the visualization tools. The codeis available on GitHub atbit.ly/2K48HKz.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

16/02/2021

A comparative study of calibration methods for imbalanced class incremental learning

access here

Author(s)

Adrian Popescu; Céline Hudelot; Eden Belouadah; Umang Aggarwal

Institution

CEA; Université Paris-Saclay;

Abstract

Deep learning approaches are successful in a wide range of AI problems and in particular for visual recognition tasks. However, there are still open problems among which is the capacity to handle streams of visual information and the management of class imbalance in datasets. Existing research approaches these two problems separately while they co-occur in real world applications. Here, we study the problem of learning incrementally from imbalanced datasets. We focus on algorithms which have a constant deep model complexity and use a bounded memory to store exemplars of old classes across incremental states. Since memory is bounded, old classes are learned with fewer images than new classes and an imbalance due to incremental learning is added to the initial dataset imbalance. A score prediction bias in favor of new classes appears and we evaluate a comprehensive set of score calibration methods to reduce it. Evaluation is carried with three datasets, using two dataset imbalance configurations and three bounded memory sizes. Results show that most calibration methods have beneficial effect and that they are most useful for lower bounded memory sizes, which are most interesting in practice. As a secondary contribution, we remove the usual distillation component from the loss function of incremental learning algorithms. We show that simpler vanilla fine tuning is a stronger backbone for imbalanced incremental learning algorithms.

Access

Open Access

Type of Publication

Journal article

Publisher

N/A

access here

21/01/2021

DeepFusion: Deep Ensembles for Domain Independent System Fusion

access here

Author(s)

Bogdan Ionescu; Liviu-Daniel Stefan; Mihai Gabriel Constantin

Institution

University Politehnica of Bucharest

Abstract

While ensemble systems and late fusion mechanisms have proven their effectiveness by achieving state-of-the-art results in various computer vision tasks, current approaches are not exploiting the power of deep neural networks as their primary ensembling algorithm, but only as inducers, i.e., systems that are used as inputs for the primary ensembling algorithm. In this paper, we propose several deep neural network architectures as ensembling algorithms with various network configurations that use dense and attention layers, an input pre-processing algorithm, and a new type of deep neural network layer denoted the Cross-Space-Fusion layer, that further improves the overall results. Experimental validation is carried out on several data sets from various domains (emotional content classification, medical data captioning) and under various evaluation conditions (two-class regression, binary classification, and multi-label classification), proving the efficiency of DeepFusion.

Access

Open Access

Type of Publication

Conference paper

Publisher

MultiMedia Modeling

access here

10/01/2021

Defending Neural ODE Image Classifiers from Adversarial Attacks with Tolerance Randomization

access here

Author(s)

Fabio Carrara; Fabrizio Falchi; Giuseppe Amato; Roberto Cardelli;

Institution

CNIT Florence ISTI-CNR;

Abstract

Deep learned models are now largely adopted in different fields, and they generally provide superior performances with respect to classical signal-based approaches. Notwithstanding this, their actual reliability when working in an unprotected environment is far enough to be proven. In this work, we consider a novel deep neural network architecture, named Neural Ordinary Differential Equations (N-ODE), that is getting particular attention due to an attractive property–a test-time tunable trade-off between accuracy and efficiency. This paper analyzes the robustness of N-ODE image classifiers when faced against a strong adversarial attack and how its effectiveness changes when varying such a tunable trade-off. We show that adversarial robustness is increased when the networks operate in different tolerance regimes during test time and training time. On this basis, we propose a novel adversarial detection strategy for N-ODE nets based on the randomization of the adaptive ODE solver tolerance. Our evaluation performed on standard image classification benchmarks shows that our detection technique provides high rejection of adversarial examples while maintaining most of the original samples under white-box attacks and zero-knowledge adversaries.

Access

Open Access

Type of Publication

Conference paper

Publisher

Multimedia FORensics in the WILD

access here

31/12/2020

Word-Class Embeddings for Multiclass Text Classification

access here

Author(s)

Alejandro Moreo; Andrea Esuli; Fabrizio Sebastiani;

Institution

ISTI-CNR;

Abstract

Pre-trained word embeddings encode general word semantics and lexical regularities of natural language, and have proven useful across many NLP tasks, including word sense disambiguation, machine translation, and sentiment analysis, to name a few. In supervised tasks such as multiclass text classification (the focus of this article) it seems appealing to enhance word representations with ad-hoc embeddings that encode task-specific information. We propose (supervised) word-class embeddings (WCEs), and show that, when concatenated to (unsupervised) pre-trained word embeddings, they substantially facilitate the training of deep-learning models in multiclass classification by topic. We show empirical evidence that WCEs yield a consistent improvement in multiclass classification accuracy, using six popular neural architectures and six widely used and publicly available datasets for multi- class text classification. One further advantage of this method is that it is conceptually simple and straightforward to implement. Our code that implements WCEs is publicly available at https://github.com/AlexMoreo/word-class-embeddings.

Access

Open Access

Type of Publication

Journal article

Publisher

Data Mining and Knowledge Discovery

access here

20/12/2020

A Critical Reassessment of the Saerens-Latinne-Decaestecker Algorithm for Posterior Probability Adjustment

access here

Author(s)

Andrea Esuli; Fabrizio Sebastiani;

Institution

ISTI-CNR;

Abstract

We critically re-examine the Saerens-Latinne-Decaestecker (SLD) algorithm, a well-known method for estimating class prior probabilities (“priors”) and adjusting posterior probabilities (“posteriors”) in scenarios characterized by distribution shift, i.e., difference in the distribution of the priors between the training and the unlabelled documents. Given a machine learned classifier and a set of unlabelled documents for which the classifier has returned posterior probabilities and estimates of the prior probabilities, SLD updates them both in an iterative, mutually recursive way, with the goal of making both more accurate; this is of key importance in downstream tasks such as single-label multiclass classification and cost-sensitive text classification. Since its publication, SLD has become the standard algorithm for improving the quality of the posteriors in the presence of distribution shift, and SLD is still considered a top contender when we need to estimate the priors (a task that has become known as “quantification”). However, its real effectiveness in improving the quality of the posteriors has been questioned. We here present the results of systematic experiments conducted on a large, publicly available dataset, across multiple amounts of distribution shift and multiple learners. Our experiments show that SLD improves the quality of the posterior probabilities and of the estimates of the prior probabilities, but only when the number of classes in the classification scheme is very small and the classifier is calibrated. As the number of classes grows, or as we use non-calibrated classifiers, SLD converges more slowly (and often does not converge at all), performance degrades rapidly, and the impact of SLD on the quality of the prior estimates and of the posteriors becomes negative rather than positive.

Access

Open Access

Type of Publication

Journal article

Publisher

ACM Transactions on Information Systems

access here

08/12/2020

A Comprehensive Study of Class Incremental Learning Algorithms for Visual Tasks

access here

Author(s)

Adrian Popescu; Eden Belouadah; Ioannis Kanellos

Institution

IMT Atlantic; Université Paris-Saclay;

Abstract

The ability of artificial agents to increment their capabilities when confronted with new data is an open challenge in artificial intelligence. The main challenge faced in such cases is catastrophic forgetting, i.e., the tendency of neural networks to underfit past data when new ones are ingested. A first group of approaches tackles forgetting by increasing deep model capacity to accommodate new knowledge. A second type of approaches fix the deep model size and introduce a mechanism whose objective is to ensure a good compromise between stability and plasticity of the model. While the first type of algorithms were compared thoroughly, this is not the case for methods which exploit a fixed size model. Here, we focus on the latter, place them in a common conceptual and experimental framework and propose the following contributions: (1) define six desirable properties of incremental learning algorithms and analyze them according to these properties, (2) introduce a unified formalization of the class-incremental learning problem, (3) propose a common evaluation framework which is more thorough than existing ones in terms of number of datasets, size of datasets, size of bounded memory and number of incremental states, (4) investigate the usefulness of herding for past exemplars selection, (5) provide experimental evidence that it is possible to obtain competitive performance without the use of knowledge distillation to tackle catastrophic forgetting and (6) facilitate reproducibility by integrating all tested methods in a common open-source repository. The main experimental finding is that none of the existing algorithms achieves the best results in all evaluated settings. Important differences arise notably if a bounded memory of past classes is allowed or not.

Access

Open Access

Type of Publication

Journal article

Publisher

Neural Networks

access here

09/10/2020

Deep Learning for Procedural Content Generation

access here

Author(s)

Ahmed Khalifa; Georgios N. Yannakakis; Jialin Liu; Julian Togelius; Sam Snodgrass; Sebastian Risi;

Institution

IT University of Copenhagen; Modl.ai; New York University; University of Malta University of Science and Technology of Shenzhen;

Abstract

Procedural content generation in video games has a long history. Existing procedural content generation methods, such as search-based, solver-based, rule-based and grammar-based methods have been applied to various content types such as levels, maps, character models, and textures. A research field centered on content generation in games has existed for more than a decade. More recently, deep learning has powered a remarkable range of inventions in content production, which are applicable to games. While some cutting-edge deep learning methods are applied on their own, others are applied in combination with more traditional methods, or in an interactive setting. This article surveys the various deep learning methods that have been applied to generate game content directly or indirectly, discusses deep learning methods that could be used for content generation purposes but are rarely used today, and envisages some limitations and potential future directions of deep learning for procedural content generation.

Access

Open Access

Type of Publication

Journal article

Publisher

N/A

access here

Fine-Grained Visual Textual Alignment for Cross-Modal Retrieval Using Transformer Encoders

access here

Author(s)

Andrea Esuli; Claudio Gennaro; Fabrizio Falchi; Giuseppe Amato; Nicola Messina; Stéphane Marchand-Maillet

Institution

ISTI-CNR; University of Geneva;

Abstract

Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal matching remains a challenging task. In this work, we tackle the task of cross-modal retrieval through image-sentence matching based on word-region alignments, using supervision only at the global image-sentence level. Specifically, we present a novel approach called Transformer Encoder Reasoning and Alignment Network (TERAN). TERAN enforces a fine-grained match between the underlying components of images and sentences (i.e., image regions and words, respectively) to preserve the informative richness of both modalities. TERAN obtains state-of-the-art results on the image retrieval task on both MS-COCO and Flickr30k datasets. Moreover, on MS-COCO, it also outperforms current approaches on the sentence retrieval task.

Focusing on scalable cross-modal information retrieval, TERAN is designed to keep the visual and textual data pipelines well separated. Cross-attention links invalidate any chance to separately extract visual and textual features needed for the online search and the offline indexing steps in large-scale retrieval systems. In this respect, TERAN merges the information from the two domains only during the final alignment phase, immediately before the loss computation. We argue that the fine-grained alignments produced by TERAN pave the way toward the research for effective and efficient methods for large-scale cross-modal information retrieval. We compare the effectiveness of our approach against relevant state-of-the-art methods. On the MS-COCO 1K test set, we obtain an improvement of 5.7% and 3.5% respectively on the image and the sentence retrieval tasks on the Recall@1 metric. The code used for the experiments is publicly available on GitHub at https://github.com/mesnico/TERAN.

Access

Open Access

Type of Publication

Journal article

Publisher

N/A

access here

Overview of The MediaEval 2021 Predicting Media Memorability Task

access here

Author(s)

Alba Seco de Herrera; Ana Matran-Fernandez; Bogdan Ionescu; Camilo Fosco; Claire-Hélène Demarty; Graham Healy; lan Smeaton; Lorin Sweeney; Mihai Gabriel Constantin Rukiye Savran Kiziltepe; Sebastian Halder

Institution

Dublin City University; Massachusetts Institute of Technology Cambridge; University of Essex; University Politehnica of Bucharest

Abstract

This paper describes the MediaEval 2021 Predicting Media Memorability task, which is in its 4th edition this year, as the prediction of short-term and long-term video memorability remains a challenging task. In 2021, two datasets of videos are used: first, a subset of the TRECVid 2019 Video-to-Text dataset; second, the Memento10K dataset in order to provide opportunities to explore cross-dataset generalisation. In addition, an Electroencephalography (EEG)-based prediction pilot subtask is introduced. In this paper, we outline the main aspects of the task and describe the datasets, evaluation metrics, and requirements for participants’ submissions

Access

Open Access

Type of Publication

Conference paper

Publisher

MediaEval

access here

Learning to Generalize Unseen Domains via Memory-based Multi-Source Meta-Learning for Person Re-Identification

access here

Author(s)

Fengxiang Yang; Nicu Sebe; Yaojing Lin; Yuyang Zhao; Zhiming Luo; Zhun Zhong

Institution

Minnan Normal University; University of Trento; Xiamen University

Abstract

Recent advances in person re-identification (ReID) obtain impressive accuracy in the supervised and unsupervised learning settings. However, most of the existing methods need to train a new model for a new domain by accessing data. Due to public privacy, the new domain data are not always accessible, leading to a limited applicability of these methods. In this paper, we study the problem of multisource domain generalization in ReID, which aims to learn a model that can perform well on unseen domains with only several labeled source domains. To address this problem, we propose the Memory-based Multi-Source Meta-Learning (M3L) framework to train a generalizable model for unseen domains. Specifically, a meta-learning strategy is introduced to simulate the train-test process of domain generalization for learning more generalizable models. To overcome the unstable meta-optimization caused by the parametric classifier, we propose a memory-based identification loss that is non-parametric and harmonizes with meta-learning. We also present a meta batch normalization layer (MetaBN) to diversify meta-test features, further establishing the advantage of meta-learning. Experiments demonstrate that our M3L can effectively enhance the generalization ability of the model for unseen domains and can outperform the state-of-the-art methods on four large-scale ReID datasets.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

Domain Adaptation for Traffic Density Estimation

access here

Author(s)

Carlos Santiago; Claudio Gennaro; Giuseppe Amato; João Paulo Costeira; Luca Ciampi;

Institution

Instituto Superior Técnico; ISTI-CNR;

Abstract

Convolutional Neural Networks have produced state-of-the-art results for a multitude of computer vision tasks under supervised learning. However, the crux of these methods is the need for a massive amount of labeled data to guarantee that they generalize well to diverse testing scenarios. In many real-world applications, there is indeed a large domain shift between the distributions of the train (source) and test (target) domains, leading to a significant drop in performance at inference time. Unsupervised Domain Adaptation (UDA) is a class of techniques that aims to mitigate this drawback without the need for labeled data in the target domain. This makes it particularly useful for the tasks in which acquiring new labeled data is very expensive, such as for semantic and instance segmentation. In this work, we propose an end-to-end CNN-based UDA algorithm for traffic density estimation and counting, based on adversarial learning in the output space. The density estimation is one of those tasks requiring per-pixel annotated labels and, therefore, needs a lot of human effort. We conduct experiments considering different types of domain shifts, and we make publicly available two new datasets for the vehicle counting task that were also used for our tests. One of them, the Grand Traffic Auto dataset, is a synthetic collection of images, obtained using the graphical engine of the Grand Theft Auto video game, automatically annotated with precise per-pixel labels. Experiments show a significant improvement using our UDA algorithm compared to the model’s performance without domain adaptation.

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

Multi-label quantification

access here

Author(s)

Alejandro Moreo; Fabrizio Sebastiani; Manuel Francisco

Institution

ISTI-CNR; University of Granada

Abstract

Quantification, variously called supervised prevalence estimation or learning to quantify, is the supervised learning task of generating predictors of the relative frequencies (a.k.a. prevalence values) of the classes of interest in unlabelled data samples. While many quantification methods have been proposed in the past for binary problems and, to a lesser extent, single-label multiclass problems, the multi-label setting (i.e., the scenario in which the classes of interest are not mutually exclusive) remains by and large unexplored. A straightforward solution to the multi-label quantification problem could simply consist of recasting the problem as a set of independent binary quantification problems. Such a solution is simple but naïve, since the independence assumption upon which it rests is, in most cases, not satisfied. In these cases, knowing the relative frequency of one class could be of help in determining the prevalence of other related classes. We propose the first truly multi-label quantification methods, i.e., methods for inferring estimators of class prevalence values that strive to leverage the stochastic dependencies among the classes of interest in order to predict their relative frequencies more accurately. We show empirical evidence that natively multi-label solutions outperform the naïve approaches by a large margin. The code to reproduce all our experiments is available online.

Access

Open Access

Type of Publication

Journal article

Publisher

ACM Transactions on Knowledge Discovery from Data

access here

Self-Supervised Representation Learning with Cross-Context Learning between Global and Hypercolumn Features

access here

Author(s)

Chen Feng; Ioannis Patras; Zheng Gao;

Institution

Queen Mary University of London;

Abstract

Access

Open Access

Type of Publication

Conference paper

Publisher

N/A

access here

A Spatio-Temporal Attentive Network for Video-Based Crowd Counting

access here

Author(s)

Claudio Gennaro; Fabrizio Falchi; Luca Ciampi; Marco Avvenuti; Marco Bongiovanni; Nicola Messina;

Institution

ISTI-CNR; University of Pisa

Abstract

Automatic people counting from images has recently drawn attention for urban monitoring in modern Smart Cities due to the ubiquity of surveillance camera networks. Current computer vision techniques rely on deep learning-based algorithms that estimate pedestrian densities in still, individual images. Only a bunch of works take advantage of temporal consistency in video sequences. In this work, we propose a spatio-temporal attentive neural network to estimate the number of pedestrians from surveillance videos. By taking advantage of the temporal correlation between consecutive frames, we lowered state-of-the-art count error by 5% and localization error by 7.5% on the widely-used FDST benchmark.

Access

Open Access

Type of Publication

Conference paper

Publisher

IEEE Symposium on Computers and Communications

access here

Cookie Settings

AI4Media may use cookies to store your login data, collect statistics to optimize the website’s functionality and to perform marketing actions based on your interests. You can personalize your cookies in .

Scientific papers

Find all the scientific publications produced by AI4Media partners, presenting the latest scientific findings of the project.