Alan F. Smeaton Alba Seco de Herrera; Bogdan Ionescu; Claire-Hélène Demarty; Faiyaz Doctor Graham Healy; Lorin Sweeney; Mihai Gabriel Constantin Rukiye Savran Kiziltepe;
Dublin City University; InterDigital Paris University of Essex; University Politehnica of Bucharest
Using a collection of publicly available links to short form video clips of an average of 6 seconds duration each, 1,275 users manually annotated each video multiple times to indicate both longterm and short-term memorability of the videos. The annotations were gathered as part of an online memory game and measured a participant’s ability to recall having seen the video previously when shown a collection of videos. The recognition tasks were performed on videos seen within the previous few minutes for short-term memorability and within the previous 24 to 72 hours for long-term memorability. Data includes the reaction times for each recognition of each video. Associated with each video are text descriptions (captions) as well as a collection of image-level features applied to 3 frames extracted from each video (start, middle and end). Video-level features are also provided. The dataset was used in the Video Memorability task as part of the MediaEval benchmark in 2020.
Open Access
Journal article
Data in Brief
Daniel Gatica-Perez; Mario Parra
Idiap Research Institute
In this study, we evaluated the feasibility of using zero-shot classification models for activity recognition in a Digital Sommelier. Our experiment involved preprocessing video data by extracting frames and categorization user activities related to a wine-tasting scenario. Image classification models demonstrated high accuracy, nearing 90%, in distinguishing between “engaged” and ” disengaged” stated. however, video classification models presented a lower performance in classifying user activities such as “observing wine”, “smelling wine” and “snipping wine”, with an average accuracy of around 50% due to the interdependent nature of activities. Despite these challenges, our findings highlight the potential of zero-shot classification models in enhancing virtual assistants’ ability to recognize and respond to user activities.
Open Access
Publication
N/A
Mathias-Felipe de Lima-Santos Wilson Ceron
Universidade Federal de São Paulo; University of Amsterdam;
The information landscape has undergone significant transformations with the widespread adoption of the internet and online social networks. This has led to both positive and negative consequences. On the positive side, information can now spread quickly and reach a vast audience. Social media platforms have played a crucial role in fostering a culture of participation by motivating people to actively create and share content. However, there were also drawbacks. Social media platforms employ algorithms that restrict the diversity of content users are exposed to, leading to the reinforcement of pre-existing beliefs, commonly referred to as “echo chambers”
Open Access
Book section
Mapping Lies in the Global Media Spherre - Routlege
Alex Gomez-Villa Bartłomiej Twardowski Joost van de Weijer Marco Buzzelli Simone Zini
Autonomous University of Barcelona University of Florence; University of Milano- Bicocca
Several recent works on self-supervised learning are trained by mapping different augmentations of the same image to the same feature representation. The data augmentations used are of crucial importance to the quality of learned feature representations. In this paper, we analyze how the color jitter traditionally used in data augmentation negatively impacts the quality of the color features in learned feature representations. To address this problem, we propose a more realistic, physics-based color data augmentation – which we call Planckian Jitter – that creates realistic variations in chromaticity and produces a model robust to illumination changes that can be commonly observed in real life, while maintaining the ability to discriminate image content based on color information. Experiments confirm that such a representation is complementary to the representations learned with the currently-used color jitter augmentation and that a simple concatenation leads to significant performance gains on a wide range of downstream datasets.
In addition, we present a color sensitivity analysis that documents the impact of different training methods on model neurons and shows that the performance of the learned features is robust with respect to illuminant variations. Official code available at: https://github.com/TheZino/PlanckianJitter
Open Access
Conference paper
N/A
Hanna Lukashevich Jakob Abeßer Joachim Bös Sascha Grollmisch Sebastian Stober
Fraunhofer IDMT; Otto-von-Guericke University Magdeburg
Music classification algorithms use signal processing and machine learning approaches to extract and enrich metadata for audio recordings in music archives. Common tasks include music genre classification, where each song is assigned a single label (such as Rock, Pop, or Jazz), and musical instrument classification. Since music metadata can be ambiguous, classification algorithms cannot always achieve fully accurate predictions. Therefore, our focus extends beyond the correctly estimated class labels to include realistic confidence values for each potential genre or instrument label. In practice, many state-of-the-art classification algorithms based on deep neural networks exhibit overconfident predictions, complicating the interpretation of the final output values. In this work, we examine whether the issue of overconfident predictions and, consequently, non-representative confidence values is also relevant to music genre classification and musical instrument classification.
Moreover, we describe techniques to mitigate this behavior and assess the impact of deep ensembles and temperature scaling in generating more realistic confidence outputs, which can be directly employed in real-world music tagging applications.
Open Access
Conference paper Publication
Audio Mostly Conference
Alejandro Moreo; Berta Chulvi Paolo Rosso Silvia Corbara
ISTI-CNR; Scuola Normale Superiore Universitat Politècnica de Valènvia
Among the many tasks of the authorship field, Authorship Identification aims at uncovering the author of a document, while Author Profiling focuses on the analysis of personal characteristics of the author(s), such as gender, age, etc. Methods devised for such tasks typically focus on the style of the writing, and are expected not to make inferences grounded on the topics that certain authors tend to write about. In this paper, we present a series of experiments evaluating the use of topic- agnostic feature sets for Authorship Identification and Author Profiling tasks in Spanish political language. In particular, we propose to employ features based on rhythmic and psycholinguistic patterns, obtained via different approaches of text masking that we use to actively mask the underlying topic. We feed these feature sets to a SVM learner, and show that they lead to results that are comparable to those obtained by a BETO transformer, when the latter is trained on the original text, i.e., potentially learning from topical information. Moreover, we further investigate the results for the different authors, showing that variations in performance are partially explainable.
Open Access
Conference paper
N/A
Alberto Del Bimbo; Andrea Ciamarra Federico Becattini; Lorenzo Seidenari;
University of Florence;
Forecasting motion and spatial positions of objects is of fundamental importance, especially in safety-critical settings such as autonomous driving. In this work, we address the issue by forecasting two different modalities that carry complementary information, namely optical flow and depth. To this end we propose FLODCAST a flow and depth forecasting model that leverages a multitask recurrent architecture, trained to jointly forecast both modalities at once. We stress the importance of training using flows and depth maps together, demonstrating that both tasks improve when the model is informed of the other modality. We train the proposed model to also perform predictions for several timesteps in the future. This provides better supervision and leads to more precise predictions, retaining the capability of the model to yield outputs autoregressively for any future time horizon. We test our model on the challenging Cityscapes dataset, obtaining state of the art results for both flow and depth forecasting. Thanks to the high quality of the generated flows, we also report benefits on the downstream task of segmentation forecasting, injecting our predictions in a flow-based mask-warping framework.
Open Access
Journal article
Pattern Recognition Letters
Christos Koutlis; Giorgios Kordopatis-Zilos; Ioannis Kompatsiaris; Ioannis Sarridis Symeon Papadopoulos
CERTH - Center for Research and Technology Hellas
In this paper, we introduce InDistill, a method that serves as a warmup stage for enhancing Knowledge Distillation (KD) effectiveness. InDistill focuses on transferring critical information flow paths from a heavyweight teacher to a lightweight student. This is achieved through a curriculum learning-based training scheme that considers the distillation difficulty of each layer and the critical learning periods when the information flow paths are established. This procedure can lead to a student model that is better prepared to learn from the teacher. To ensure the applicability of InDistill across a wide range of teacher-student pairs, we also incorporate a pruning operation when there is a discrepancy in the width of the teacher and student layers. This pruning operation reduces the width of the teacher’s intermediate layers to match those of the student, allowing direct distillation without the need for an encoding stage. The proposed method is extensively evaluated using various pairs of teacher-student architectures on CIFAR-10, CIFAR-100, and ImageNet datasets showcasing that preserving the information flow paths consistently increases the performance of the baseline KD approaches on both classification and retrieval settings.
Open Access
Conference paper
N/A
Christos Papaioannidis Ioanna Valsamara; Ioannis Pitas;
Aristotle University of Thessaloniki;
Recently, multi-agent systems that facilitate knowledge sharing among Deep Neural Network (DNN) agents, have gained increasing attention. This paper explores the dynamics of multi-agent systems that support Teacher-Student DNN interactions, where knowledge is distilled from Teachers to Students. Within such systems, selecting the most compatible Teacher for a given task is far from trivial and can lead to low-quality decisions. Hence, the need arises for accurate domain knowledge evaluation. In that context, we propose including an OOD detection module in each DNN agent to enable effective agent expertise evaluation and precise identification of suitable Teachers. This setup allows Student agents to distill knowledge from the most knowledgeable Teachers within a specific domain, ensuring optimal system performance. To effectively utilize OOD detection in this context, we address key challenges such as determining the minimum data cardinality required to ensure optimal performance and reliable inferences of the OOD detectors.
Open Access
Paper Publication Research article
N/A
Christos Papaioannidis Ioanna Valsamara; Ioannis Pitas;
Aristotle University of Thessaloniki;
In today’s data-driven world, the exponential growth of data across various sectors presents unique opportunities and challenges. In this paper, we propose a novel method tailored to enhance the efficiency of Deep Neural Networks (DNNs) in managing these vast data amounts. The primary challenge addressed is the ability of DNNs to provide inferences on the minimal amount of data without sacrificing their quality, a significant concern given the vast scales involved in big data analytics. Our approach emphasizes DNN inference efficiency and reliability, enabling DNNs to deliver accurate inferences while substantially reducing computational complexity. This study explores the increasingly attractive deployment of DNNs for complex tasks, focusing on determining the minimal amount of data necessary to ensure optimal network performance and reliable inference outputs, improving the applicability of DNNs across various big data environments.
Open Access
Paper Publication Research article
N/A
Dimitrios Papaioannou; Ioannis Pitas; Vasileios Mygdalis
Aristotle University of Thessaloniki; University of Antwerp
In the realm of machine learning systems, achieving consensus among networking nodes is a fundamental yet challenging task. This paper presents Proof of Quality Inference (PoQI), a novel consensus protocol designed to integrate deep learning inference under the basic format of the Practical Byzantine Fault Tolerant (P-BFT) algorithm. PoQI is applied to Deep Neural Networks (DNNs) to infer the quality and authenticity of produced estimations by evaluating the trustworthiness of the DNN node’s decisions. In this manner, PoQI enables DNN inference nodes to reach a consensus on a common DNN inference history in a fully decentralized fashion, rather than relying on a centralized inference decision-making process. Through PBFT adoption, our method ensures byzantine fault tolerance, permitting DNN nodes to reach an agreement on inference validity swiftly and efficiently. We demonstrate the efficacy of PoQI through theoretical analysis and empirical evaluations, highlighting its potential to forge trust among unreliable DNN nodes.
Open Access
Paper Preprint Publication
N/A
Antidio Viguria Francisco Pérez-Grau Ioannis Pitas; Marco Montes-Grova Vasileios Mygdalis
Aristotle University of Thessaloniki;
Novel view synthesis is the task of generating new images that render an object or scene from a different viewpoint than the one given. It aims to create new views of a specific subject starting from a number of pictures taken from known points of view. The novel view synthesis problem can be approached in two different ways: as a problem of interpolation of images between two known images or extrapolation of images from one or a subset of images. During this work, the problem of the extrapolation will be addressed, taking advantage of the fact that it is possible to pre-calculate the trajectories that we want the camera that takes the images to execute, from a series of known shot-types. Based on that and on the Autoregressive Transformers, it is presented an end-to-end tool for novel-view synthesis from previously unvisited points of view for aerial cinematography robots.
Open Access
Paper Preprint Publication
N/A
Anestis Christidis Christos Papaioannidis Ioannis Mademlis; Ioannis Pitas;
Aristotle University of Thessaloniki;
Human gesture recognition is a very important tool in human-computer or human-robot interaction. In many cases, such algorithms may need to be executed on systems with limited computational capabilities, due to size or weight constraints, introducing restrictions that can impede gesture recognition performance. This paper proposes a gesture recognition method that is based on a very simple and lightweight Deep Neural Network (DNN) architecture, suitable for embedded execution. In order to achieve increased accuracy without a large computational/memory overhead, the proposed method utilizes as input both full 2D human body skeletons and image patches extracted from regions of interest (e.g., around human arms) in each video frame. These two input types are processed in parallel by separate modules and the corresponding features are fused before being exploited for gesture recognition. Reliance on 2D skeleton sequences allows the utilization of a lightweight DNN architecture, while the image patches convey rich semantic information that enhances gesture recognition performance. This approach is unlike existing similar methods, which only exploit skeleton sequences. Experimental evaluation indeed shows increased recognition accuracy, indicating that the proposed method offers a reliable solution for human gesture recognition on embedded systems.
Open Access
Paper Preprint Publication
N/A
Claudio Gennaro; Fabrizio Falchi; Gabriele Lagani; Giuseppe Amato;
ISTI-CNR;
Recent work on sample efficient training of Deep Neural Networks (DNNs) proposed a semi-supervised methodology based on biologically inspired Hebbian learning, combined with traditional backprop-based training. Promising results were achieved on various computer vision benchmarks, in scenarios of scarce labeled data availability. However, current Hebbian learning solutions can hardly address large-scale scenarios due to their demanding computational cost. In order to tackle this limitation, in this contribution, we investigate a novel solution, named FastHebb (FH), based on the reformulation of Hebbian learning rules in terms of matrix multiplications, which can be executed more efficiently on GPU. Starting from Soft-Winner-Takes-All (SWTA) and Hebbian Principal Component Analysis (HPCA) learning rules, we formulate their improved FH versions: SWTA-FH and HPCA-FH. We experimentally show that the proposed approach accelerates training speed up to 70 times, allowing us to gracefully scale Hebbian learning experiments on large datasets and network architectures such as ImageNet and VGG.
Open Access
Journal article
Carlos Santiago; Claudio Gennaro; Fabrizio Falchi; Giuseppe Amato; Luca Ciampi;
Institute of Information Science and Technologies Instituto Superior Técnico;
This work addresses the challenge of video violence detection in data-scarce scenarios, focusing on bridging the domain gap that often hinders the performance of deep learning models when applied to unseen domains. We present a novel unsupervised domain adaptation (UDA) scheme designed to effectively mitigate this gap by combining supervised learning in the train (source) domain with unlabeled test (target) data. We employ single-image classification and multiple instance learning (MIL) to select frames with the highest classification scores, and, upon this, we exploit UDA techniques to adapt the model to unlabeled target domains. We perform an extensive experimental evaluation, using general-context data as the source domain and target domain datasets collected in specific environments, such as violent/non-violent actions in hockey matches and public transport. The results demonstrate that our UDA pipeline substantially enhances model performances, improving their generalization capabilities in novel scenarios without requiring additional labeled data.
Open Access
Journal article
SN Computer Science
Alberto Del Bimbo; Federico Becattini; Francesco Marchetti; Lorenzo Seidenari; Lucile Sassatelli; Quentin Guimard
Institut Universitaire de France; Université Côte d'Azur; University of Florence;
Prediction of head movements in immersive media is key to designing efficient streaming systems able to focus the bandwidth budget on visible areas of the content. However, most of the numerous proposals made to predict user head motion in 360° images and videos do not explicitly consider a prominent characteristic of the head motion data: its intrinsic uncertainty. In this article, we present an approach to generate multiple plausible futures of head motion in 360° videos, given a common past trajectory. To our knowledge, this is the first work that considers the problem of multiple head motion prediction for 360° video streaming. We introduce our discrete variational multiple sequence (DVMS) learning framework, which builds on deep latent variable models. We design a training procedure to obtain a flexible, lightweight stochastic prediction model compatible with sequence-to-sequence neural architectures. Experimental results on 4 different datasets show that our method DVMS outperforms competitors adapted from the self-driving domain by up to 41% on prediction horizons up to 5 sec., at lower computational and memory costs. To understand how the learned features account for the motion uncertainty, we analyze the structure of the learned latent space and connect it with the physical properties of the trajectories. We also introduce a method to estimate the likelihood of each generated trajectory, enabling the integration of DVMS in a streaming system. We hence deploy an extensive evaluation of the interest of our DVMS proposal for a streaming system. To do so, we first introduce a new Python-based 360° streaming simulator that we make available to the community. On real-world user, video, and networking data, we show that predicting multiple trajectories yields higher fairness between the traces, the gains for 20 to 30% of the users reaching up to 10% in visual quality for the best number K of trajectories to generate.
Open Access
Journal article
ACM Transactions on Multimedia Computing, Communications, and Applications
Claudio Gennaro; Giuseppe Amato; Lucia Vadicamo;
ISTI-CNR;
Permutation-based Indexing (PBI) approaches have been proven to be particularly effective for conducting large-scale approximate metric searching. These methods rely on the idea of transforming the original metric objects into permutation representations, which can be efficiently indexed using data structures such as inverted files.
The standard conceptualization of permutation associated with a metric object involves only the use of object distances and their relative orders from a set of anchors called pivots. In this paper, we generalized this definition in order to enlarge the class of permutation representations that can be used by PBI approaches. In particular, we introduced the concept of permutation induced by a space transformation and a sorting function, and we investigated which properties these transformations should possess to produce permutations that are effective for metric search. Furthermore, as a practical outcome, we defined a new type of permutation representation that is calculated using distances from pairs of pivots. This proposed technique allowed us to produce longer permutations than traditional ones for the same number of object pivot distance calculations. The advantage lies in the fact that when longer permutations are employed, the use of inverted files built on permutation prefixes leads to greater efficiency in the search phase.
Open Access
Journal article
N/A
Georgios Tzimiropoulos; Ioannis Maniadis Metaxas; Ioannis Patras;
Queen Mary University of London;
Self-supervised learning has recently emerged as the preeminent pretraining paradigm across and between modalities, with remarkable results. In the image domain specifically, group (or cluster) discrimination has been one of the most successful methods. However, such frameworks need to guard against heavily imbalanced cluster assignments to prevent collapse to trivial solutions. Existing works typically solve this by reweighing cluster assignments to promote balance, or with offline operations (e.g. regular re-clustering) that prevent collapse. However, the former typically requires large batch sizes, which leads to increased resource requirements, and the latter introduces scalability issues with regard to large datasets. In this work, we propose ExCB, a framework that tackles this problem with a novel cluster balancing method. ExCB estimates the relative size of the clusters across batches and balances them by adjusting cluster assignments, proportionately to their relative size and in an online manner. Thereby, it overcomes previous methods’ dependence on large batch sizes and is fully online, and therefore scalable to any dataset. We conduct extensive experiments to evaluate our approach and demonstrate that ExCB: a) achieves state-of-the-art results with significantly reduced resource requirements compared to previous works, b) is fully online, and therefore scalable to large datasets, and c) is stable and effective even with very small batch sizes.
Open Access
Conference paper
European Conference on Computer Vision
Chen Feng; Georgios Tzimiropoulos; Ioannis Patras;
Queen Mary University of London;
Despite the large progress in supervised learning with neural networks, there are significant challenges in obtaining high-quality, large-scale and accurately labelled datasets. In such contexts, how to learn in the presence of noisy labels has received more and more attention. Addressing this relatively intricate problem to attain competitive results predominantly involves designing mechanisms that select samples that are expected to have reliable annotations. However, these methods typically involve multiple off-the-shelf techniques, resulting in intricate structures. Furthermore, they frequently make implicit or explicit assumptions about the noise modes/ratios within the dataset. Such assumptions can compromise model robustness and limit its performance under varying noise conditions. Unlike these methods, in this work, we propose an efficient and effective framework with minimal hyperparameters that achieves SOTA results in various benchmarks. Specifically, we design an efficient and concise training framework consisting of a subset expansion module responsible for exploring non-selected samples and a model training module to further reduce the impact of noise, called NoiseBox . Moreover, diverging from common sample selection methods based on the “small loss” mechanism, we introduce a novel sample selection method based on the neighbouring relationships and label consistency in the feature space. Without bells and whistles, such as model co-training, self-supervised pre-training and semi-supervised learning, and with robustness concerning the settings of its few hyper-parameters, our method significantly surpasses previous methods on both CIFAR10/CIFAR100 with synthetic noise and real-world noisy datasets such as Red Mini-ImageNet, WebVision, Clothing1M and ANIMAL-10N.
Open Access
Journal article
N/A
Andrea Esuli; Claudio Gennaro; Davide Alessandro Coccomini; Fabrizio Falchi; Giuseppe Amato;
ISTI-CNR;
Open Access
Publication
N/A
Jean De Meyere Noémie Krack
KU Leuven
Recently, the British police launched its first investigation into a case of virtual “rape” in the metaverse. This paper delves into the complex considerations that user safety and content moderation could pose through the prism of the recently adopted Digital Services Act (DSA). We first explore the current state of platform operating metaverses. Metaverses are similar to current online platforms yet are differentiated by the use of XR technologies. Despite the low number of users on such platforms, specific issues related to the metaverse, such as the rise of disinformation or virtual sex crimes, have already been reported. This paper considers the following research questions: What legal challenges do specific metaverse platforms present in terms of user safety, and how does the DSA address these challenges? Attention will be brought to the impact of relevant obligations for user safety in metaverses. We continue our analysis by addressing the lack of risk assessment obligations for platform operating metaverses, as they currently do not meet the threshold to be bound by these obligations under the DSA. We conclude with recommendations for policymakers on how to tackle the challenges posed by increased risks in the metaverse.
Open Access
Conference paper
International Congress towards a responsible development of Metaverse
Christos Koutlis; Symeon Papadopoulos
CERTH - Center for Research and Technology Hellas
The recently developed and publicly available synthetic image generation methods and services make it possible to create extremely realistic imagery on demand, raising great risks for the integrity and safety of online information. State-of-the-art Synthetic Image Detection (SID) research has led to strong evidence on the advantages of feature extraction from foundation models. However, such extracted features mostly encapsulate high-level visual semantics instead of fine-grained details, which are more important for the SID task. On the contrary, shallow layers encode low-level visual information. In this work, we leverage the image representations extracted by intermediate Transformer blocks of CLIP’s image-encoder via a lightweight network that maps them to a learnable forgery-aware vector space capable of generalizing exceptionally well. We also employ a trainable module to incorporate the importance of each Transformer block to the final prediction. Our method is compared against the state-of-the-art by evaluating it on 20 test datasets and exhibits an average +10.6% absolute performance improvement. Notably, the best performing models require just a single epoch for training (~8 minutes). Code available at https://github.com/mever-team/rine.
Open Access
Conference paper
European Conference on Computer Vision
Ioannis Kompatsiaris; John Violos Symeon Papadopoulos
CERTH - Center for Research and Technology Hellas
This paper discusses four facets of the Knowledge Distillation (KD) process for Convolutional Neural Networks (CNNs) and Vision Transformer (ViT) architectures, particularly when executed on edge devices with constrained processing capabilities. First, we conduct a comparative analysis of the KD process between CNNs and ViT architectures, aiming to elucidate the feasibility and efficacy of employing different architectural configurations for the teacher and student, while assessing their performance and efficiency. Second, we explore the impact of varying the size of the student model on accuracy and inference speed, while maintaining a constant KD duration. Third, we examine the effects of employing higher resolution images on the accuracy, memory footprint and computational workload. Last, we examine the performance improvements obtained by fine-tuning the student model after KD to specific downstream tasks. Through empirical evaluations and analyses, this research provides AI practitioners with insights into optimal strategies for maximizing the effectiveness of the KD process on edge devices.
Open Access
Conference paper
Signal Processing and Communication
Hannes Fassold;
Joanneum Research;
The detection of shot boundaries (hardcuts and short dissolves), sampling structure (progressive / interlaced / pulldown) and dynamic keyframes in a video are fundamental video analysis tasks which have to be done before any further high-level analysis tasks. We present a novel algorithm which does all these analysis tasks in an unified way, by utilizing a combination of inter-frame and intra-frame measures derived from the motion field and normalized cross correlation. The algorithm runs four times faster than real-time due to sparse and selective calculation of these measures.
Open Access
Publication
Conference on Imaging, Signal Processing and Communication
Claudio Gennaro; Claudio Vairo; Fabio Carrara; Fabrizio Falchi; Giuseppe Amato; Lucia Vadicamo; Nicola Messina; Paolo Bolettieri;
ISTI-CNR;
VISIONE is a versatile video retrieval system supporting diverse search functionalities, including free-text, similarity, and temporal searches. Its recent success in securing first place in the 2024 Video Browser Showdown (VBS) highlights its effectiveness.
Originally designed for analyzing, indexing, and searching diverse video content, VISIONE can also be adapted to images from lifelog cameras thanks to its reliance on frame-based representations and retrieval mechanisms.
In this paper, we present an overview of VISIONE’s core characteristics and the adjustments made to accommodate lifelog images. These adjustments primarily focus on enhancing result visualization within the GUI, such as grouping images by date or hour to align with lifelog dataset imagery. It’s important to note that while the GUI has been updated, the core search engine and visual content analysis components remain unchanged from the version presented at VBS 2024. Specifically, metadata such as local time, GPS coordinates, and concepts associated with images are not indexed or utilized in the system. Instead, the system relies solely on the visual content of the images, with date and time information extracted from their filenames, which are utilized exclusively within the GUI for visualization purposes.
Our objective is to evaluate the system’s performance within the Lifelog Search Challenge, emphasizing reliance on visual content analysis without additional metadata.
Open Access
Conference paper
Bergman Clement Frédéric Precioso Julie Tores Léa Andolfi Lucile Sassatelli; Magali Guaresi Sarah Lecossais Thierry Devars Victor Ecrement Virginie Julliard Wu Hui-Yin
Inria; Institut Universitaire de France; Sorbonne Université Université Côte d'Azur; Université Sorbonne Paris Nord
In film gender studies the concept of “male gaze” refers to the way the characters are portrayed on-screen as objects of desire rather than subjects. In this article we introduce a novel video-interpretation task to detect character objectification in films. The purpose is to reveal and quantify the usage of complex temporal patterns operated in cinema to produce the cognitive perception of objectification. We introduce the ObyGaze12 dataset made of 1914 movie clips densely annotated by experts for objectification concepts identified in film studies and psychology. We evaluate recent vision models show the feasibility of the task and where the challenges remain with concept bottleneck models. Our new dataset and code are made available to the community.
Open Access
Conference paper Publication
IEEE Conference on Computer Vision and Pattern Recognition
Tobias Blanke;
University of Amsterdam;
Archives have long been a key concern of academic debates about truth, memory, recording and power and are important sites for social sciences and humanities research. This has been the case for traditional archives, but these debates have accelerated with the digital transformation of archives. The proliferation of digital tools and the fast-growing increase in digital materials have created very large digitised and born-digital archives. This article investigates how new digital archives continue existing archival practices while at the same time discontinuing them. We present novel methodologies and tools for changing memory and power relations in digital archives through new ways of reassembling marginalised, non-canonical entities in digital archives. Reassembling digital archives can take advantage of the materiality and the algorithmic processuality of digital collections and reshape them to inscribe lost voices and previously ignored differences. Digital archives are not fixed and are changed with new research and political questions and are only identified through new questions. The article presents six distinct techniques and strategies to reassemble digital archives and renders these according to three different types of new digital archives. We consider both the extension of archives towards evidence that is otherwise thrown away as well as the provision of new intensive, non-discriminatory viewpoints on existing collections.
Open Access
Journal article
N/A
Adrian Popescu; Bogdan Ionescu; Cristian Stanciu; Giorgios Kordopatis-Zilos; Luca Cuccovillo; Roberto Cardelli; Symeon Papadopoulos
CEA; CERTH - Center for Research and Technology Hellas Czech Technical University in Prague; Fraunhofer IDMT; Mercatorum University; University Politehnica of Bucharest
Front matter of the proceedings of the 3nd ACM International Workshop on Multimedia AI against Disinformation, held in Phuket (Thailand) on June 10th, 2024.
The full proceedings are available online at https://dl.acm.org/doi/proceedings/10.1145/3643491.
Open Access
Book section
ACM Association for Computing Machinery
Adrian Popescu; Bogdan Ionescu; Cristian Stanciu; Giorgios Kordopatis-Zilos; Luca Cuccovillo; Roberto Cardelli; Symeon Papadopoulos
CEA; CERTH - Center for Research and Technology Hellas Czech Technical University in Prague; Fraunhofer IDMT; Mercatorum University; University Politehnica of Bucharest
Synthetic media generation and manipulation have seen rapid advancements in recent years, making it increasingly easy to create multimedia content that is indistinguishable to the human observer. Moreover, generated content can be used maliciously by individuals and organizations in order to spread disinformation, posing a significant threat to society and democracy. Hence, there is an urgent need for AI tools geared towards facilitating a timely and effective media verification process. The MAD’24 workshop seeks to bring together people with diverse backgrounds who are dedicated to combating disinformation in multimedia through the means of AI, by fostering an environment for exploring innovative ideas and sharing experiences. The research areas of interest encompass the identification of manipulated or generated content, along with the investigation of the dissemination of disinformation and its societal repercussions. Recognizing the significance of multimedia, the workshop emphasizes the joint analysis of various modalities within content, as verification can be improved by aggregating multiple forms of content.
Open Access
Conference paper
ACM on Multimedia Retrieval
Evlampios Apostolidis; Konstantinos Tsigos Spyridon Baxevanakis; Symeon Papadopoulos Vasileios Mezaris
CERTH - Center for Research and Technology Hellas
In this paper we propose a new framework for evaluating the performance of explanation methods on the decisions of a deepfake detector. This framework assesses the ability of an explanation method to spot the regions of a fake image with the biggest influence on the decision of the deepfake detector, by examining the extent to which these regions can be modified through a set of adversarial attacks, in order to flip the detector’s prediction or reduce its initial prediction; we anticipate a larger drop in deepfake detection accuracy and prediction, for methods that spot these regions more accurately. Based on this framework, we conduct a comparative study using a state-of-the-art model for deepfake detection that has been trained on the FaceForensics++ dataset, and five explanation methods from the literature. The findings of our quantitative and qualitative evaluations document the advanced performance of the LIME explanation method against the other compared ones, and indicate this method as the most appropriate for explaining the decisions of the utilized deepfake detector.
Open Access
Conference paper
ACM on Multimedia Retrieval
Evlampios Apostolidis; Ioannis Kontostathis; Vasileios Mezaris
CERTH - Center for Research and Technology Hellas
In this paper we introduce a new dataset for 360-degree video summarization: the transformation of 360-degree video content to concise 2D-video summaries that can be consumed via traditional devices, such as TV sets and smartphones. The dataset includes ground-truth human-generated summaries, that can be used for training and objectively evaluating 360-degree video summarization methods. Using this dataset, we train and assess two state-of-the-art summarization methods that were originally proposed for 2D-video summarization, to serve as a baseline for future comparisons with summarization methods that are specifically tailored to 360-degree video. Finally, we present an interactive tool that was developed to facilitate the data annotation process and can assist other annotation activities that rely on video fragment selection.
Open Access
Conference paper
Davide Alessandro Coccomini; Fabrizio Falchi; Giorgios Kordopatis-Zilos; Giuseppe Amato; Roberto Caldelli Symeon Papadopoulos
CERTH - Center for Research and Technology Hellas Czech Technical University in Prague; ISTI-CNR;
In this paper, we present MINTIME, a video deepfake detection method that effectively captures spatial and temporal inconsistencies in videos that depict multiple individuals
and varying face sizes. Unlike previous approaches that either employ simplistic a-posteriori aggregation schemes, i.e., averaging or max operations, or only focus on the largest face in the video, our proposed method learns to accurately detect spatio-temporal inconsistencies across multiple identities in a video through a Spatio-Temporal Transformer combined with a Convolutional Neural Network backbone. This is achieved through an Identityaware Attention mechanism that applies a masking operation on the face sequence to process each identity independently, which enables effective video-level aggregation. Furthermore, our system incorporates two novel embedding schemes: (i) the Temporal Coherent Positional Embedding, which encodes the temporal information of the face sequences of each identity, and (ii) the Size Embedding, which captures the relative sizes of the faces to the video frames. MINTIME achieves state-of-the-art performance on the ForgeryNet dataset, with a remarkable improvement of up to 14% AUC in videos containing multiple people. Moreover, it demonstrates very robust generalization capabilities in crossforgery and cross-dataset settings. The code is publicly available at: https://github.com/davide-coccomini/MINTIME-Multi-Identity-size-iNvariant-TIMEsformer-for-Video-Deepfake-Detection
Open Access
Journal article
N/A
Cathal Gurrin; Fabio Carrara; Florian Spiess; Jakub Lokoč; Klaus Schoeffmann; Ladislav Peška; Loris Sauter; Luca Rossetto; Lucia Vadicamo; Minh-Triet Tran Nico Hezel Nicola Messina; Rahel Arnold; Sebastian Lubos Stefanos Vrochidis; Thao-Nhu Nguyen; Werner Bailer; Xingham Li Zhixin Ma;
CERTH - Center for Research and Technology Hellas Charles University; Dublin City University; HTW Berlin; Institute of Information Science and Technologies ISTI-CNR; Joanneum Research; University of Basel; University of Klagenfurt University of Zurich; Vietnam National University Wuhan University
This paper conducts a thorough examination of the 12th Video Browser Showdown (VBS) competition, a well-established international benchmarking campaign for interactive video search systems.
The annual VBS competition has witnessed a steep rise in the popularity of multimodal embedding-based approaches in interactive video retrieval. Most of the thirteen systems participating in VBS 2023 utilized a CLIP-based cross-modal search model, allowing the specification of free-form text queries to search visual content. This shared emphasis on joint embedding models contributed to balanced performance across various teams. However, the distinguishing factors of the top-performing teams included the adept combination of multiple models and search modes, along with the capabilities of interactive interfaces to facilitate and refine the search process.
Our work provides an overview of the state-of-the-art approaches employed by the participating systems and conducts a thorough analysis of their search logs, which record user interactions and results of their queries for each task. Our comprehensive examination of the VBS competition offers assessments of the effectiveness of the retrieval models, browsing efficiency, and user query patterns. Additionally, it provides valuable insights into the evolving landscape of interactive video retrieval and its future challenges.
Open Access
Journal article
IEEE Access
Emmanouil Krasanakis; Ioannis Kompatsiaris; Symeon Papadopoulos
CERTH - Center for Research and Technology Hellas
Being able to express broad families of equivariant or invariant attributed graph functions is a popular measuring stick of whether graph neural networks should be employed in practical applications. However, it is equally important to find deep local minima of losses (i.e., produce outputs with much smaller loss values compared to other minima), even when architectures cannot express global minima. In this work we introduce the architectural property of attracting optimization trajectories to local minima as a means of achieving smaller loss values. We take first steps in satisfying this property for losses defined over attributed undirected unweighted graphs with an architecture called universal local attractor (ULA). This refines each dimension of end-to-end-trained node feature embeddings based on graph structure to track the optimization trajectories of losses satisfying some mild conditions. The refined dimensions are then linearly pooled to create predictions. We experiment on 11 tasks, from node classification to clique detection, on which ULA is comparable with or outperforms popular alternatives of similar or greater theoretical expressive power.
Open Access
Publication
N/A
Albin Soutif-Cormerais Andrew Bagdanov Joost van de Weijer Simone Magistri Tommaso Trinci
Computer Vision Center University of Florence;
Exemplar-Free Class Incremental Learning (EFCIL) aims to learn from a sequence of tasks without having access to previous task data. In this paper, we consider the challenging Cold Start scenario in which insufficient data is available in the first task to learn a high-quality backbone. This is especially challenging for EFCIL since it requires high plasticity, which results in feature drift which is difficult to compensate for in the exemplar-free setting. To address this problem, we propose a simple and effective approach that consolidates feature representations by regularizing drift in directions highly relevant to previous tasks and employs prototypes to reduce task-recency bias. Our method, called Elastic Feature Consolidation (EFC), exploits a tractable second-order approximation of feature drift based on an Empirical Feature Matrix (EFM). The EFM induces a pseudo-metric in feature space which we use to regularize feature drift in important directions and to update Gaussian prototypes used in a novel asymmetric cross entropy loss which effectively balances prototype rehearsal with data from new tasks. Experimental results on CIFAR-100, Tiny-ImageNet, ImageNet-Subset and ImageNet-1K demonstrate that Elastic Feature Consolidation is better able to learn new tasks by maintaining model plasticity and significantly outperform the state-of-the-art.
Open Access
Conference paper
N/A
Antonios Liapis; Georgios N. Yannakakis; Marvin Zammit;
University of Malta
The recent advances in language-based generative models have paved the way for the orchestration of multiple generators of different artefact types (text, image, audio, etc.) into one system. Presently, many open-source pre-trained models combine text with other modalities, thus enabling shared vector embeddings to be compared across different generators. Within this context we propose a novel approach to handle multimodal creative tasks using Quality Diversity evolution. Our contribution is a variation of the MAP-Elites algorithm, MAP-Elites with Transverse Assessment (MEliTA), which is tailored for multimodal creative tasks and leverages deep learned models that assess coherence across modalities. MEliTA decouples the artefacts’ modalities and promotes cross-pollination between elites. As a test bed for this algorithm, we generate text descriptions and cover images for a hypothetical video game and assign each artefact a unique modality-specific behavioural characteristic. Results indicate that MEliTA can improve text-to-image mappings within the solution space, compared to a baseline MAP-Elites algorithm that strictly treats each image-text pair as one solution. Our approach represents a significant step forward in multimodal bottom-up orchestration and lays the groundwork for more complex systems coordinating multimodal creative agents in the future.
Open Access
Conference paper
N/A
Hannes Fassold;
Joanneum Research;
Deploying Large Language Models (LLMs) on mobile devices makes all the capabilities of natural language processing available on the device. An important use case of LLMs is question answering, which can provide accurate and contextually relevant answers to a wide array of user queries. We describe how we managed to port state of the art LLMs to mobile devices, enabling them to operate natively on the device. We employ the llama.cpp framework, a flexible and self-contained C++ framework for LLM inference. We selected a 6-bit quantized version of the Orca-Mini-3B model with 3 billion parameters and present the correct prompt format for this model. Experimental results show that LLM inference runs in interactive speed on a Galaxy S21 smartphone and that the model delivers high-quality answers to user queries related to questions from different subjects like politics, geography or history.
Open Access
Conference paper
N/A
Connor Richard Lucia Vadicamo;
ISTI-CNR; University of St. Andrews
Dimensionality reduction techniques map values from a high dimensional space to one with a lower dimension. The result is a space which requires less physical memory and has a faster distance calculation. These techniques are widely used where required properties of the reduced-dimension space give an acceptable accuracy with respect to the original space. Many such transforms have been described. They have been classified in two main groups: linear and topological. Linear methods such as Principal Component Analysis (PCA) and Random Projection (RP) define matrix-based transforms into a lower dimension of Euclidean space. Topological methods such as Multidimensional Scaling (MDS) attempt to preserve higher-level aspects such as the nearest-neighbour relation, and some may be applied to non-Euclidean spaces. Here, we introduce nSimplex Zen, a novel topological method of reducing dimensionality. Like MDS, it relies only upon pairwise distances measured in the original space. The use of distances, rather than coordinates, allows the technique to be applied to both Euclidean and other Hilbert spaces, including those governed by Cosine, Jensen–Shannon and Quadratic Form distances. We show that in almost all cases, due to geometric properties of high-dimensional spaces, our new technique gives better properties than others, especially with reduction to very low dimensions.
Open Access
Journal article
ACM Transactions on Knowledge Discovery from Data
Bogdan Ionescu; Mihai Gabriel Constantin
University Politehnica of Bucharest
Video memorability is one of the vital aspects of subjective multimedia perception and, as such, is closely and thoroughly studied in the computer vision literature. This paper presents the methods proposed by AIMultimediaLab for the generalization subtask of the 2023 edition of the Predicting Video Memorability task. We explore several methods for augmenting the training process for a video Vision Transformer network, aiming to increase the number of hard-to-predict samples in the training set in order to increase the robustness of the targeted AI model. Starting from our previous works, we analyze several visual features that define “hard-to-predict” samples, and based on these features, we augment the training data of our models to target those specific videos that pose problems for memorability prediction.
Open Access
Conference paper
MediaEval
Claudio Vairo; Fabio Carrara; Jakub Lokoč; Kai Uwe Barthel; Klaus Schoeffmann; Konstantin Schall; Ladislav Peška; Lucia Vadicamo; Werner Bailer;
HTW Berlin; ISTI-CNR; Joanneum Research; University of Klagenfurt
CLIP-based text-to-image retrieval has proven to be very effective at the interactive video retrieval competition Video Browser Showdown 2022, where all three top-scoring teams had implemented a variant of a CLIP model in their system. Since the performance of these three systems was quite close, this post-evaluation was designed to get better insights on the differences of the systems and compare the CLIP-based text-query retrieval engines by introducing slight modifications to the original competition settings. An extended analysis of the overall results and the retrieval performance of all systems’ functionalities shows that a strong text retrieval model certainly helps, but has to be coupled with extensive browsing capabilities and other query-modalities to consistently solve known-item-search tasks in a large scale video database.
Open Access
Journal article
International Journal of Multimedia Information Retrieval
Alejandro Moreo; Fabrizio Sebastiani; Pablo González
ISTI-CNR; University of Oviedo
Quantification is the supervised learning task that consists of training predictors of the class prevalence values of sets of unlabelled data, and is of special interest when the labelled data on which the predictor has been trained and the unlabelled data are not IID, i.e., suffer from dataset shift. To date, quantification methods have mostly been tested only on a special case of dataset shift, i.e., prior probability shift; the relationship between quantification and other types of dataset shift remains, by and large, unexplored. In this work we carry out an experimental analysis of how current quantification algorithms behave under different types of dataset shift, in order to identify limitations of current approaches and hopefully pave the way for the development of more broadly applicable methods. We do this by proposing a fine-grained taxonomy of types of dataset shift, by establishing protocols for the generation of datasets affected by these types of shift, and by testing existing quantification methods on the datasets thus generated. One finding that results from this investigation is that many existing quantification methods that had been found robust to prior probability shift are not necessarily robust to other types of dataset shift. A second finding is that no existing quantification method seems to be robust enough to dealing with all the types of dataset shift we simulate in our experiments. The code needed to reproduce all our experiments is publicly available at https:// github. com/pglez82/quant_ datasetshift
Open Access
Journal article
Data Mining and Knowledge Discovery
Christos Tzelepis; Georgios Tzimiropoulos; Ioannis Patras; Stella Bounareli; Vasileios Argyriou;
Kingston University London; Queen Mary University of London; University of Nottingham
In this paper, we present our framework for neural face/head reenactment whose goal is to transfer the 3D head orientation and expression of a target face to a source face. Previous methods focus on learning embedding networks for identity and head pose/expression disentanglement which proves to be a rather hard task, degrading the quality of the generated images. We take a different approach, bypassing the training of such networks, by using (fine-tuned) pre-trained GANs which have been shown capable of producing high-quality facial images. Because GANs are characterized by weak controllability, the core of our approach is a method to discover which directions in latent GAN space are responsible for controlling head pose and expression variations. We present a simple pipeline to learn such directions with the aid of a 3D shape model which, by construction, inherently captures disentangled directions for head pose, identity, and expression. Moreover, we show that by embedding real images in the GAN latent space, our method can be successfully used for the reenactment of real-world faces. Our method features several favorable properties including using a single source image (one-shot) and enabling cross-person reenactment. Extensive qualitative and quantitative results show that our approach typically produces reenacted faces of notably higher quality than those produced by state-of-the-art methods for the standard benchmarks of VoxCeleb1 & 2.
Open Access
Journal article
Ioannis Patras; Zheng Gao;
Queen Mary University of London;
Self-supervised pre-training has been proved to be effective in learning transferable representations that benefit various visual tasks. This paper asks this question: can self-supervised pre-training learn general facial representations for various facial analysis tasks? Recent efforts toward this goal are limited to treating each face image as a whole, i.e., learning consistent facial representations at the image-level, which overlooks the consistency of local facial representations (i.e., facial regions like eyes, nose, etc). In this work, we make a first attempt to propose a novel self-supervised facial representation learning framework to learn consistent global and local facial representations, Facial Region Awareness (FRA). Specifically, we explicitly enforce the consistency of facial regions by matching the local facial representations across views, which are extracted with learned heatmaps highlighting the facial regions. Inspired by the mask prediction in supervised semantic segmentation, we obtain the heatmaps via cosine similarity between the per-pixel projection of feature maps and facial mask embeddings computed from learnable positional embeddings, which leverage the attention mechanism to globally look up the facial image for facial regions. To learn such heatmaps, we formulate the learning of facial mask embeddings as a deep clustering problem by assigning the pixel features from the feature maps to them. The transfer learning results on facial classification and regression tasks show that our FRA outperforms previous pre-trained models and more importantly, using ResNet as the unified backbone for various tasks, our FRA achieves comparable or even better performance compared with SOTA methods in facial analysis tasks.
Open Access
Conference paper
IEEE Conference on Computer Vision and Pattern Recognition
Aleksandra Kuczerawy Lidia Dutklewicz; Noémie Krack Peggy Valcke
KU Leuven
This chapter discusses how AI technologies permeate the media sector. It sketches opportunities and benefits of the use of AI in media content gathering and production, in media content distribution, in fact-checking and content moderation. The chapter then zooms in on ethical and legal risks raised by AI-driven media applications: lack of data availability, poor data quality and bias in training datasets, lack of transparency, risks for the right to freedom of expression, threats to media freedom and pluralism online, and threats to media independence. Finally, the
chapter introduces the relevant elements of the EU legal framework which aim to mitigate these risks, such as the Digital Services Act, the European Media Freedom Act proposal and the AI Act proposal.
Open Access
Book section
Cambridge handbook on the law, ethics and policy of Artificial Intelligence
Konstantinos Gkrispanis; Nikolaos Gkalelis; Vasileios Mezaris
CERTH - Center for Research and Technology Hellas
Face detectors are becoming a crucial component of many applications, including surveillance, that often have to run on edge devices with limited processing power and memory. Therefore, there’s a pressing demand for compact face detection models that can function efficiently across resource-constrained devices. Over recent years, network pruning techniques have attracted a lot of attention from researchers. These methods haven’t been well examined in the context of face detectors, despite their expanding popularity. In this paper, we implement filter pruning on two already small and compact face detectors, named EXTD (Extremely Tiny Face Detector) and EResFD (Efficient ResNet Face Detector). The main pruning algorithm that we utilize is Filter Pruning via Geometric Median (FPGM), combined with the Soft Filter Pruning (SFP) iterative procedure. We also apply L1 Norm pruning, as a baseline to compare with the proposed approach. The experimental evaluation on the WIDER FACE dataset indicates that the proposed approach has the potential to further reduce the model size of already lightweight face detectors, with limited accuracy loss, or even with small accuracy gain for low pruning rates.
Open Access
Conference paper
Winter Conference on Applications of Computer Vision
Evlampios Apostolidis; Konstantinos Apostolidis; Vasileios Mezaris
CERTH - Center for Research and Technology Hellas
This paper presents a web-based tool that facilitates the production of tailored summaries for online sharing on social media. Through an interactive user interface, it supports a “one-click” video summarization process. Based on the integrated AI models for video summarization and aspect ratio transformation, it facilitates the generation of multiple summaries of a full-length video according to the needs of target platforms with regard to the video’s length and aspect ratio.
Open Access
Conference paper
Conference on Multimedia Modeling
Evlampios Apostolidis; Ioannis Kontostathis; Vasileios Mezaris
CERTH - Center for Research and Technology Hellas
In this work, we present an integrated system for spatiotemporal summarization of 360-degrees videos. The video summary production mainly involves the detection of salient events and their synopsis into a concise summary. The analysis relies on state-of-the-art methods for saliency detection in 360-degrees video (ATSal and SST-Sal) and video summarization (CA-SUM). It also contains a mechanism that classifies a 360-degrees video based on the use of static or moving camera during recording and decides which saliency detection method will be used, as well as a 2D video production component that is responsible to create a conventional 2D video containing the salient events in the 360-degrees video. Quantitative evaluations using two datasets for 360-degrees video saliency detection (VR-EyeTracking, Sports-360) show the accuracy and positive impact of the developed decision mechanism, and justify our choice to use two different methods for detecting the salient events. A qualitative analysis using content from these datasets, gives further insights about the functionality of the decision mechanism, shows the pros and cons of each used saliency detection method and demonstrates the advanced performance of the trained summarization method against a more conventional approach.
Open Access
Conference paper
Conference on Multimedia Modeling
Bogdan Ionescu; Hannes Fassold; Mihai Dogariu Werner Bailer;
Joanneum Research; University Politehnica of Bucharest
Open Access
Conference paper
Multimedia Modeling Conference
Claudio Gennaro; Claudio Vairo; Fabio Carrara; Fabrizio Falchi; Giuseppe Amato; Lucia Vadicamo; Nicola Messina; Paolo Bolettieri;
ISTI-CNR;
In this paper, we introduce the fifth release of VISIONE, an advanced video retrieval system offering diverse search functionalities. The user can search for a target video using textual prompts, drawing objects and colors appearing in the target scenes in a canvas, or images as query examples to search for video keyframes with similar content.
Compared to the previous version of our system, which was runner-up at VBS 2023, the forthcoming release, set to participate in VBS 2024, showcases a refined user interface that enhances its usability and updated AI models for more effective video content analysis.
Open Access
Conference paper
N/A
Alberto Del Bimbo; Federico Becattini; Francesco Marchetti; Lorenzo Seidenari;
Università degli Studi di Firenze; University of Florence;
Effective modeling of human interactions is of utmost importance when forecasting behaviors such as future trajectories. Each individual, with its motion, influences surrounding agents since everyone obeys to social non-written rules such as collision avoidance or group following. In this paper we model such interactions, which constantly evolve through time, by looking at the problem from an algorithmic point of view, i.e., as a data manipulation task. We present a neural network based on an end-to-end trainable working memory, which acts as an external storage where information about each agent can be continuously written, updated and recalled. We show that our method is capable of learning explainable cause-effect relationships between motions of different agents, obtaining state-of-the-art results on multiple trajectory forecasting datasets.
Open Access
Journal article
IEEE Transaction on Pattern Analysis and Machine Intelligence
Andrea Esuli; Davide Alessandro Coccomini; Fabrizio Falchi; Nicola Messina;
Institute of Information Science and Technologies
With the increasing importance of multimedia and multilingual data in online encyclopedias, novel methods are needed to fill domain gaps and automatically connect different modalities for increased accessibility. For example, Wikipedia is composed of millions of pages written in multiple languages. Images, when present, often lack textual context, thus remaining conceptually floating and harder to find and manage.
In this work, we tackle the novel task of associating images from Wikipedia pages with the correct caption among a large pool of available ones written in multiple languages, as required by the image-caption matching Kaggle challenge organized by the Wikimedia Foundation. A system able to perform this task would improve the accessibility and completeness of the underlying multi-modal knowledge graph in online encyclopedias. We propose a cascade of two models powered by the recent Transformer networks able to efficiently and effectively infer a relevance score between the query image data and the captions. We verify through extensive experiments that the proposed cascaded approach effectively handles a large pool of images and captions while maintaining bounded the overall computational complexity at inference time.
With respect to other approaches in the challenge leaderboard, we can achieve remarkable improvements over the previous proposals (+8% in nDCG@5 with respect to the sixth position) with constrained resources.
Open Access
Journal article
Multimedia Tools and Applications
Adrian Popescu; Bertrand Delezoide Céline Hudelot; David Picard; Eva Feillet; Grégoire Petit; Michael Soumm
CEA; Université Gustave Eiffel; Université Paris-Saclay;
Class-Incremental Learning (CIL) aims to build classification models from data streams. At each step of the CIL process, new classes must be integrated into the model. Due to catastrophic forgetting, CIL is particularly challenging when examples from past classes cannot be stored, the case on which we focus here. To date, most approaches are based exclusively on the target dataset of the CIL process. However, the use of models pre-trained in a self-supervised way on large amounts of data has recently gained momentum.
The initial model of the CIL process may only use the first batch of the target dataset, or also use pre-trained weights obtained on an auxiliary dataset. The choice between these two initial learning strategies can significantly influence the performance of the incremental learning model, but has not yet been studied in depth. Performance is also influenced by the choice of the CIL algorithm, the neural architecture, the nature of the target task, the distribution of classes in the stream and the number of examples available for learning.
We conduct a comprehensive experimental study to assess the roles of these factors. We present a statistical analysis framework that quantifies the relative contribution of each
factor to incremental performance. Our main finding is that the initial training strategy is the dominant factor influencing the average incremental accuracy, but that the choice of
CIL algorithm is more important in preventing forgetting.
Based on this analysis, we propose practical recommendations for choosing the right initial training strategy for a given incremental learning use case. These recommendations are intended to facilitate the practical deployment of incremental learning.
Open Access
Conference paper
N/A
Alberto Del Bimbo; Davide Pucci Federico Becattini;
Università degli Studi di Firenze; University of Florence;
Action understanding is a fundamental computer vision branch for several applications, ranging from surveillance to robotics. Most works deal with localizing and recognizing the action in both time and space, without providing a characterization of its evolution. Recent works have addressed the prediction of action progress, which is an estimate of how far the action has advanced as it is performed. In this paper, we propose to predict action progress using a different modality compared to previous methods: body joints. Human body joints carry very precise information about human poses, which we believe are a much more lightweight and effective way of characterizing actions and therefore their execution. Estimating action progress can in fact be determined based on the understanding of how key poses follow each other during the development of an activity. We show how an action progress prediction model can exploit body joints and integrate it with modules providing keypoint and action information in order to be run directly from raw pixels. The proposed method is experimentally validated on the Penn Action Dataset.
Open Access
Journal article
MDPI
Alberto Messina; Angelo Bruccoleri; Fulvio Negro; Maurizio Montagnuolo; Roberto Iacoviello;
RAI;
Knowledge about the presence of people in a video is a valuable source of information in many applications, such as video annotation, retrieval and summarisation. The contribution of this paper goes in the direction of demonstrating how AI-based face processing technologies can be profitably used to perform video annotation of television content. To validate our vision, we developed the Face Management Framework (FMF), which implements an end-to-end pipeline for face analysis and content annotation based on few-shot or zero-shot face embedding extraction models. The results of the test campaign of the system show that the key performance indicators that we defined were exceeded by a wide margin, demonstrating how media workflows could greatly benefit from the tool and the efficiency improvements it brings.
Open Access
Conference paper
International Conference on Big Data
Alberto Del Bimbo; Andrea Ciamarra Federico Becattini; Lorenzo Seidenari; Roberto Cardelli;
Mercatorum University; University of Florence;
The ever-increasing use of synthetically generated content in different sectors of our everyday life, one for all media information, poses a strong need for deepfake detection tools in order to avoid the proliferation of altered messages. The process to identify manipulated content, in particular images and videos, is basically performed by looking for the presence of some inconsistencies and/or anomalies specifically due to the fake generation process. Different techniques exist in the scientific literature that exploit diverse ad-hoc features in order to highlight possible modifications. In this paper, we propose to investigate how deepfake creation can impact on the characteristics that the whole scene had at the time of the acquisition. In particular, when an image (video) is captured the overall geometry of the scene (e.g. surfaces) and the acquisition process (e.g. illumination) determine a univocal environment that is directly represented by the image pixel values; all these intrinsic relations are possibly changed by the deepfake generation process. By resorting to the analysis of the characteristics of the surfaces depicted in the image it is possible to obtain a descriptor usable to train a CNN for deepfake detection: we refer to such an approach as SurFake. Experimental results carried out on the FF + + dataset for different kinds of deep-fake forgeries and diverse deep learning models confirm that such a feature can be adopted to discriminate between pristine and altered images; furthermore, experiments witness that it can also be combined with visual data to provide a certain improvement in terms of detection accuracy.
Open Access
Conference paper
N/A
Alberto Del Bimbo; Federico Becattini; Lorenzo Seidenari; Luca Cultrera; Pietro Pala
Università degli Studi di Firenze; University of Florence;
Conditional Imitation learning is a common and effective approach to train autonomous driving agents. However, two issues limit the full potential of this approach: (i) the inertia problem, a special case of causal confusion where the agent mistakenly correlates low speed with no acceleration, and (ii) low correlation between offline and online performance due to the accumulation of small errors that brings the agent in a previously unseen state. Both issues are critical for state-aware models, yet informing the driving agent of its internal state as well as the state of the environment is of crucial importance. In this article we propose a multi-task learning agent based on a multi-stage vision transformer with state token propagation. We feed the state of the vehicle along with the representation of the environment as a special token of the transformer and propagate it throughout the network. This allows us to tackle the aforementioned issues from different angles: guiding the driving policy with learned stop/go information, performing data augmentation directly on the state of the vehicle and visually explaining the model’s decisions. We report a drastic decrease in inertia and a high correlation between offline and online metrics.
Open Access
Journal article
IEEE Transactions on Intelligent Vehicles
Claudio Gennaro; Claudio Vairo; Fabio Carrara; Fabrizio Falchi; Giuseppe Amato; Lucia Vadicamo; Nicola Messina; Paolo Bolettieri;
ISTI-CNR;
This paper presents a revised version of the VISIONE video retrieval system, which offers a wide range of search functionalities, including free text search, spatial color and object search, visual and semantic similarity search, and temporal search. The system is designed to ensure scalability using advanced indexing techniques and effectiveness using cutting-edge Artificial Intelligence technology for visual content analysis. VISIONE was the runner-up in the 2023 Video Browser Showdown competition, demonstrating its comprehensive video retrieval capabilities. In this paper, we detail the improvements made to the search and browsing interface to enhance its usability for non-expert users.
A demonstration video of our system with the restyled interface, showcasing its capabilities on over 2,300 hours of diverse video content, is available online at {\url{https://youtu.be/srD3TCUkMSg}}.
Open Access
Conference paper
Conference on Content-based Multimedia Indexing
Andrei Cosmin Jitaru Bogdan Ionescu; Mihai Dogariu Mihai Gabriel Constantin
University Politehnica of Bucharest
Memorability is a critical aspect of human cognition that has been studied extensively in various fields, including psychology, education, and computer vision. The ability to remember information and experiences over time is essential for learning, decision-making, and creating lasting impressions. While the number of computer vision works that attempt to predict the memorability score of videos has recently seen a significant boost, thanks to several benchmarking tasks and datasets, some questions related to the performance of automated systems on certain types of videos are still largely unexplored. Given this, we are interested in discerning what makes a video sample easy or hard to classify or predict from a memorability standpoint. In this paper, we use a large set of runs, created and submitted by the participants to the MediaEval Predicting Video Memorability task, and, using their results and a set of visual, object, and annotator-based features and analyses, we attempt to find and define common traits that make the memorability scores of videos hard or easy to predict.
Open Access
Conference paper
Conference on Content-based Multimedia Indexing
Ambrish Rawat; Anisa Halimi; Nathalie Baracaldo; Swanand Kadhe;
IBM Research;
Training large language models (LLMs) is a costly endeavour in terms of time and computational resources. The large amount of training data used during the unsupervised pre-training phase makes it difficult to verify all data and, unfortunately, undesirable data may be ingested during training. Re-training from scratch is impractical and has led to the creation of the unlearning discipline where models are modified to “unlearn” undesirable information without retraining. However, any modification can alter the behaviour of LLMs, especially on key dimensions such as fairness. This is the first work that examines this interplay between unlearning and fairness for LLMs. In particular, we focus on a popular unlearning framework known as SISA [Bourtoule et al., 2021], which creates an ensemble of models trained on disjoint shards. We evaluate the performance-fairness trade-off for SISA, and empirically demsontrate that SISA can indeed reduce fairness in LLMs. To remedy this, we propose post-processing bias mitigation techniques for ensemble models produced by SISA. We adapt the post-processing fairness improvement technique from [Hardt et al., 2016] to design three methods that can handle model ensembles, and prove that one of the methods is an optimal fair predictor for ensemble of models. Through experimental results, we demonstrate the efficacy of our post-processing framework called FairSISA.
Open Access
Conference paper
Socially Responsible Language Modelling Research
Elena Cabrio Mariana Chaves Pierpaolo Goffredo Serena Villata
CNRS Inria; Université Côte d'Azur;
Fallacies are arguments that employ faulty reasoning. Given their persuasive and seemingly valid nature, fallacious arguments are often used in political debates. Employing these misleading arguments in politics can have detrimental consequences for society, since they can lead to inaccurate conclusions and invalid inferences from the public opinion and the policymakers. Automatically detecting and classifying fallacious arguments represents therefore a crucial challenge to limit the spread of misleading or manipulative claims and promote a more informed and healthier political discourse. Our contribution to address this challenging task is twofold. First, we extend the ElecDeb60To16 dataset of U.S. presidential debates annotated with fallacious arguments, by incorporating the most recent Trump-Biden presidential debate. We include updated token level annotations, incorporating argumentative components (i.e., claims and premises), the relations between these components (i.e., support and attack), and six categories of fallacious arguments (i.e., Ad Hominem, Appeal to Authority, Appeal to Emotion, False Cause, Slippery Slope, and Slogans). Second, we perform the twofold task of fallacious argument detection and classification by defining neural network architectures based on Transformers models, combining text, argumentative features, and engineered features. Our results show the advantages of complementing transformer-generated text representations with non-textual features.
Open Access
Conference paper
Association for Computational Linguistics Empirical Methods in Natural Language Processing
Luca Cuccovillo; Milica Gerhardt; Patrick Aichroth;
Fraunhofer IDMT;
In this study we propose a novel approach to audio phylogeny, i.e. the detection of relationships and transformations within a set of near-duplicate audio items, by leveraging a deep neural network for efficiency and extensibility. Unlike existing methods, our approach detects transformations between nodes in one step, and the transformation set can be expanded by retraining the neural network without excessive computational costs. We evaluated our method against the state of the art using a self-created and publicly released dataset, observing a superior performance in reconstructing phylogenetic trees and heightened transformation detection accuracy. Moreover, the ability to detect a wide range of transformations and to extend the transformation set make the approach suitable for various applications.
Open Access
Conference paper
IEEE International Workshop of Information Forensics and Security
Christina Katsini; George E. Raptis; Vasilis Theodorou;
Human Opsis
The News and Media landscape has undergone significant transformations in recent years, driven by the rise of new technologies and the widespread use of social media. This evolution introduces unique challenges for professionals working within this environment (e.g., journalists, content creators, and news authors), with a major one being the efficient sourcing of images that complement article content. In response to this challenge, we developed VIREO, a tool that recommends images based on textual content. In this paper, we make a step towards the practical effectiveness of VIREO’s core models in recommending images for real-world articles, with a specific focus on image recommendation efficiency. Our results indicate that VIREO offers a promising solution for professionals seeking to meet the evolving demands of the News and Media landscape while maintaining content quality and engagement.
Open Access
Conference paper
International Conference on Computer and Applications
Angelo Canale; Fabrizio Falchi; Giovanni Benelli; Giuseppe Amato; Luca Ciampi; Luca Incrocci; Stefano Chessa; Valeria Zeni;
ISTI-CNR; University of Pisa
Integrated Pest Management (IPM) is an essential approach used in smart agriculture to manage pest populations and sustainably optimize crop production. One of the cornerstones underlying IPM solutions is pest monitoring, a practice often performed by farm owners by using chromotropic sticky traps placed on insect hot spots to gauge pest population densities. In this paper, we propose a modular model-agnostic deep learning-based counting pipeline for estimating the number of insects present in pictures of chromotropic sticky traps, thus reducing the need for manual trap inspections and minimizing human effort. Additionally, our solution generates a set of raw positions of the counted insects and confidence scores expressing their reliability, allowing practitioners to filter out unreliable predictions. We train and assess our technique by exploiting PST – Pest Sticky Traps, a new collection of dot-annotated images we created on purpose and we publicly release, suitable for counting whiteflies. Experimental evaluation shows that our proposed counting strategy can be a valuable Artificial Intelligence-based tool to help farm owners to control pest outbreaks and prevent crop damages effectively. Specifically, our solution achieves an average counting error of approximately 9% compared to human capabilities requiring a matter of seconds, a large improvement respecting the time-intensive process of manual human inspections, which often take hours or even days.
Open Access
Journal article
Ecological Informatics
Hervé Le Borgne; Michel Crucianu; Nicolas Audebert; Perla Doubinsky
CEA; Conservatoire National des Arts et Métiers;
With the availability of powerful text-to-image diffusion models, recent works have explored the use of synthetic data to improve image classification performances. These works show that it can effectively augment or even replace real data. In this work, we investigate how synthetic data can benefit few-shot class-agnostic counting. This requires to generate images that correspond to a given input number of objects. However, text-to-image models struggle to grasp the notion of count. We propose to rely on a double conditioning of Stable Diffusion with both a prompt and a density map in order to augment a training dataset for few-shot counting. Due to the small dataset size, the fine-tuned model tends to generate images close to the training images. We propose to enhance the diversity of synthesized images by exchanging captions between images thus creating unseen configurations of object types and spatial layout. Our experiments show that our diversified generation strategy significantly improves the counting accuracy of two recent and performing few-shot counting models on FSC147 and CARPK.
Open Access
Conference paper
Winter Conference on Applications of Computer Vision
Patrick Aichroth; Thomas Köllmer Zühal Kurt
Atilim University Fraunhofer IDMT;
The paper outlines an explainable knowledge graph-based recommendation system that aims to provide personalized news recommendations and tries to explain why an item is recommended to a particular user. The system leverages a knowledge graph (KG) that models the relationships between items and users’ preferences, as well as external knowledge sources such as item features and user profiles. The main objectives of this study are to train a recommendation model that can predict whether a user will click on a news article or not, and then obtain the explainable recommendations for the same purpose. This is achieved with three steps: Firstly, KG of the MIND dataset are generated based on the history and, the clicked information of the users, the category and subcategory of the news. Then, the path reasoning approaches are utilized to reach explainable paths of recommended news/items. Thirdly, the proposed KG-based model is evaluated using MIND News data sets. Experiments have been conducted using the MIND-demo and MINDsmall datasets, which are the open-source English news datasets for public research scope. Experimental results indicate that the proposed approach performs better in terms of recommendation explainability, making it a promising basis for developing transparent and interpretable recommendation systems.
Open Access
Conference paper Publication
Conference on Knowledge Discovery
Alberto Messina; Stefano Scotta;
RAI;
In this work, we present an example of how a relatively small Large Language Model (LLM) fine-tuned to perform a simple and well defined task (assigning titles to news articles) could perform similarly or even better than huge LLMs which are created to respond to any question. This approach of specializing smaller LLMsonsimplertasksisalsointeresting because it goes in the direction of makingthis technology more sustainable and available to a higher number of entities that usually could not use these expensive models, both for economic and data policy reasons. We also present a couple of examples of how can be evaluated the performances of LLMs when the task is specified as in the example that we present in this work.
Open Access
Conference paper
International Conference of the Italian Association for Artificial Intelligence
Albert Gatt Andrea Pedrotti; Anette Frank Aykut Erdem Emre Can Acikgoz Erkut Erdem Iacer Calixto Ilker Kesen Leticia Parcalabescu Michele Cafagna Mustafa Dogan
Hacettepe University Heidelberg University Institute of Information Science and Technologies Koç University University of Amsterdam; University of Malta University of Pisa Utrecht University;
With the ever-increasing popularity of pretrained Video-Language Models (VidLMs), there is a pressing need to develop robust evaluation methodologies that delve deeper into their visio-linguistic capabilities. To address this challenge, we present ViLMA (Video Language Model Assessment), a task-agnostic benchmark that places the assessment of fine-grained capabilities of these models on a firm footing. Task-based evaluations, while valuable, fail to capture the complexities and specific temporal aspects of moving images that VidLMs need to process. Through carefully curated counterfactuals, ViLMA offers a controlled evaluation suite that sheds light on the true potential of these models, as well as their performance gaps compared to human-level understanding. ViLMA also includes proficiency tests, which assess basic capabilities deemed essential to solving the main counterfactual tests. We show that current VidLMs’ grounding abilities are no better than those of vision-language models which use static images. This is especially striking once the performance on proficiency tests is factored in. Our benchmark serves as a catalyst for future research on VidLMs, helping to highlight areas that still need to be explored.
Open Access
Conference paper
Alberto Del Bimbo; Hondamunige Prasanna Silva Lorenzo Seidenari;
University of Florence;
This paper presents a novel reconstruction method that leverages Diffusion Models to protect machine learning classifiers against adversarial attacks, all without requiring any modifications to the classifiers themselves. The susceptibility of machine learning models to minor input perturbations renders them vulnerable to adversarial attacks. While diffusion-based methods are typically disregarded for adversarial defense due to their slow reverse process, this paper demonstrates that our proposed method offers robustness against adversarial threats while preserving clean accuracy, speed, and plug-and-play compatibility. Code at: https://github.com/HondamunigePrasannaSilva/DiffDefence.
Open Access
Conference paper
N/A
Marius Gavrilescu;
Technical University of Iasi
The identification of important structures from volume data is a challenging problem in information visualization due to the complexity and amount of detail found in volume data sets. In particular, medical imaging devices generate scans which contain a significant amount of important anatomical structures, some of which are hidden, occluded or otherwise difficult to highlight. Conventional density and gradient-based classification methods fail to uncover such structures, thereby creating the necessity for more elaborate visualization methods and the involvement of multiple visual criteria in order to generate quality representations of the volume data. We propose a volume visualization approach which extends the conventional rendering pipeline by incorporating visibility-based quality criteria into the color and opacity mapping process. Our method consists in using two stacked transfer functions which handle visual mappings: one based on the density domain of the data set, and the other on a custom metric which quantifies the visibility of volumetric structures. We show that this arrangement allows the generation of improved representations of meaningful hidden structures from medical CT data, while constituting a reliable means of identifying volumetric details not representable using traditional approaches.
Open Access
Conference paper
E-Health and Bioengineering Conference 2023
Evlampios Apostolidis; Ioannis Patras; Vasileios Mezaris
CERTH - Center for Research and Technology Hellas Queen Mary University of London;
In this paper we present our study on the use of attention for explaining video summarization. We build on a recent work that formulates the task, called XAI-SUM, and we extend it by: a) taking into account two additional network architectures and b) introducing two novel explanation signals that relate to the entropy and diversity of attention weights. In total, we examine the effectiveness of seven types of explanation, using three state-of-the-art attention-based network architectures (CA-SUM, VASNet, SUM-GDA) and two datasets (SumMe, TVSum) for video summarization. The conducted evaluations show that the inherent attention weights are more suitable for explaining network architectures which integrate mechanisms for estimating attentive diversity (SUM-GDA) and uniqueness (CA-SUM). The explanation of simpler architectures (VASNet) can benefit from taking into account estimates about the strength of the input vectors, while another option is to consider the entropy of attention weights.
Open Access
Conference paper
ACM Multimedia
Alberto Del Bimbo; Lorenzo Berlincioni; Marco Bertini; Stefano Berretti
Università degli Studi di Firenze; University of Florence;
Time-varying sequences of 3D point clouds, or 4D point clouds, are now being acquired at an increasing pace in several applications (personal avatar representation, LiDAR in autonomous or assisted driving). In many cases, such volume of data is transmitted, thus requiring that proper compression tools are applied to either reduce the resolution or the bandwidth. In this paper, we propose a new solution for upscaling and restoration of time-varying 3D video point clouds after they have been heavily compressed. Our model consists of a specifically designed Graph Convolutional Network that combines Dynamic Edge Convolution and Graph Attention Networks for feature aggregation in a Generative Adversarial setting. We present a different way to sample dense point clouds with the intent to make these modules work in synergy to provide each node with enough features about its neighbourhood in order to later on generate new vertices. Compared to other solutions in the literature that address the same task, our proposed model is capable of obtaining comparable results in terms of quality of the reconstruction, while using a substantially lower number of parameters (\simeq 300KB), making our solution deployable in edge computing devices.
Open Access
Conference paper
N/A
Artem Yaroshchuk; Christoforos Papastergiopoulos Dimitrios Tzovaras; Konstantinos Votis; Luca Cuccovillo; Patrick Aichroth;
CERTH - Center for Research and Technology Hellas Fraunhofer IDMT;
This paper introduces a multilingual, multispeaker dataset composed of synthetic and natural speech, designed to foster research and benchmarking in synthetic speech detection. The dataset encompasses 18,993 audio utterances synthesized from text, alongside with their corresponding natural equivalents, representing approximately 17 hours of synthetic audio data. The dataset features synthetic speech generated by 156 voices spanning three languages, namely, English, German, and Spanish, with a balanced gender representation. It targets state-of-the-art synthesis methods, and has been released with a license allowing seamless extension and redistribution by the research community.
Open Access
Conference paper
IEEE International Workshop of Information Forensics and Security
Claudio Gennaro; Fabio Carrara; Giuseppe Amato; Lucia Vadicamo;
ISTI-CNR;
The rapid development of deep learning and artificial intelligence has transformed our approach to solving scientific problems across various domains, including computer vision, natural language processing, and automatic content generation. Information retrieval (IR) has also experienced significant advancements, with natural language understanding and multimodal content analysis enabling accurate information retrieval. However, the widespread adoption of neural networks has also influenced the focus of IR problem-solving, which nowadays predominantly relies on evaluating the similarity of dense vectors derived from the latent spaces of deep neural networks. Nevertheless, the challenges of conducting similarity searches on large-scale databases with billions of vectors persist. Traditional IR approaches use inverted indices and vector space models, which work well with sparse vectors. In this paper, we propose Vec2Doc, a novel method that converts dense vectors into sparse integer vectors, allowing for the use of inverted indices. Preliminary experimental evaluation shows a promising solution for large-scale vector-based IR problems.
Open Access
Conference paper
International Conference on Similarity Search and Applications
Evlampios Apostolidis; Georgios Balaouras; Ioannis Patras; Vasileios Mezaris
CERTH - Center for Research and Technology Hellas Queen Mary University of London;
This chapter focuses on explainable video summarization, a technology that could significantly advance the content production workflow of Media organizations. It starts by presenting the current state of the art in the fields of deep-learning-based video summarization and explainable video analysis and understanding. Following, it focuses on video summarization methods that rely on the use of attention mechanisms and reports on previous works that investigated the use of attention for explaining the outcomes of deep neural networks. Subsequently, it briefly describes a state-of-the-art attention-based architecture for unsupervised video summarization and discusses a recent work that examines the use of various attention-based signals for explaining the outcomes of video summarization. Finally, it provides recommendations about future research directions.
Open Access
Book section
Encyclopedia of Information Science and Technology
Daniel Gatica-Perez; Sina Sajadmanesh
Idiap Research Institute
Graph Neural Networks (GNNs) have become a popular tool for learning on graphs, but their widespread use raised privacy concerns as graph data can contain personal or sensitive information. Differentially private GNN models have been recently proposed to preserve privacy while still allowing for effective learning over graph-structured datasets. however, achieving an ideal balance between accuracy and privacy in GNNs remains challenging due to the intrinsic structural connectivity of graphs. in this paper, we propose a new differentially private GNN called ProGAP that uses a progressive training scheme to improve such accuracy-privacy trade-offs. Combined with the aggregation perturbation technique to ensure differential privacy, ProGAP splits a GNN into a sequence of overlapping submodels that are trained progressively, expanding from the first submodel to the complete model. Specifically, each submodel is trained over the privately aggregated node embeddings learned and cached by the previous submodels, leading to an increased expressive power compared to previous approaches while limiting the incurred privacy costs. We formally prove that ProGAP ensures edge-level and node-level privacy guarantees for both training and inference stages, and evaluate its performance on benchmark graph datasets. Experimental results demonstrate that ProGAP can achieve up to 5-10% higher accuracy than existing state-of-the-art differentially private GNNs. Our code is available at https://github.com/sisaman/ProGAP.
Open Access
Publication
N/A
Alejandro Moreo; Fabrizio Sebastiani; Martin Senz; Mirko Bunse;
ISTI-CNR; University of Dortmund;
Quantification, i.e., the task of training predictors of the class prevalence values in sets of unlabeled data items, has received increased attention in recent years. However, most quantification research has concentrated on developing algorithms for binary and multiclass problems in which the classes are not ordered. Here, we study the ordinal case, i.e., the case in which a total order is defined on the set of n > 2 classes. We give three main contributions to this field. First, we create and make available two datasets for ordinal quantification (OQ) research that overcome the inadequacies of the previously available ones. Second, we experimentally compare the most important OQ algorithms proposed in the literature so far. To this end, we bring together algorithms proposed by authors from very different research fields, such as data mining and astrophysics, who were unaware of each others’ developments. Third, we propose a novel class of regularized OQ algorithms, which outperforms existing algorithms in our experiments. The key to this gain in performance is that our regularization prevents ordinally implausible estimates, assuming that ordinal distributions tend to be smooth in practice. We informally verify this assumption for several real-world applications.
Open Access
Publication
Data Mining and Knowledge Discovery
Fabrizio Falchi; Jan Sedmidubsky; Nicola Messina; Tomás Rebok;
ISTI-CNR; Masaryk University;
Open Access
Conference paper
N/A
Florin Leon; Marius Gavrilescu; Sabina-Adriana Floria;
Technical University of Iasi
Representing relevant information from volume data sets is a problem often faced in visualization. Generating meaningful images from highly-complex volume data sets is a challenging, tedious task requiring specialized knowledge of the distribution and properties of the data. Traditionally, this task has been carried out manually via specialized user interfaces. We propose a volume visualization pipeline which facilitates the automatic generation of high-quality images from volume data sets. Our method involves a direct volume renderer which generates images from volume data based on visual mappings provided by a transfer function. Central to our approach is a quality-focused descriptor which exploits the properties of the distribution of gradient orientations of an alpha-bounded surface within the volume. This feature is useful for determining transfer functions that result in the rendering of corresponding images depicting various details from the volume. We show that by using this feature as an optimization objective, the generation of high quality images can be automated. Using simple genetic algorithms, we can automatically generate sets of images illustrating coherent, easily-distinguishable and high-quality surfaces of relevant structures from volume data.
Open Access
Conference paper
International Conference on System Theory
Anastasios Gkagkas; Davide Alessandro Coccomini; Gylfi Þór Guðmundsson; Jakub Lokoč; Jiaxin Wu; Nick Pantelidis; Nicola Messina; Rahel Arnold; Silvan Heller; Vera Benz; Werner Bailer;
CERTH - Center for Research and Technology Hellas Charles University; City University Hong Kong; ISTI-CNR; Joanneum Research; Reykjavik University; University of Basel;
Different task interpretations are a highly undesired element in interactive video retrieval evaluations. When a participating team focuses partially on a wrong goal, the evaluation results might become partially misleading. In this paper, we propose a process for refining known-item and open-set type queries, and preparing the assessors that judge the correctness of submissions to open-set queries. Our findings from recent years reveal that a proper methodology can lead to objective query quality improvements and subjective participant satisfaction with query clarity.
Open Access
Conference paper
Conference on Multimedia Retrieval
David Renaudie; Georgios N. Yannakakis; Konstantinos Makantasis; Kosmas Pinitas; Matthew Barthet Mike Thomsen;
Massive Entertainment Ubisoft University of Malta
This paper introduces a large scale multimodal corpus collected for the purpose of analysing and predicting player engagement in commercial-standard games. The corpus is solicited from 25 players of the action role-playing game Tom Clancy’s The Division 2, who annotated their level of engagement using a time-continuous annotation tool. The cleaned and processed corpus presented in this paper consists of nearly 20 hours of annotated gameplay videos accompanied by logged gamepad actions. We report preliminary results on predicting long-term player engagement based on in-game footage and game controller actions using Convolutional Neural Network architectures. Results obtained suggest we can predict the player engagement with up to accuracy on average ( at best) when we fuse information from the game footage and the player’s controller input. Our findings validate the hypothesis that long-term (i.e. 1 hour of play) engagement can be predicted efficiently solely from pixels and gamepad actions.
Open Access
Paper
Conference on Multimodal Interaction
Evlampios Apostolidis; Georgios Balaouras; Ioannis Patras; Vasileios Mezaris
CERTH - Center for Research and Technology Hellas Queen Mary University of London;
This paper presents a new reinforcement-based method for video thumbnail selection (called RL-DiVTS), that relies on estimates of the aesthetic quality, representativeness and visual diversity of a small set of selected frames, made with the help of tailored reward functions. The proposed method integrates a novel diversity-aware Frame Picking mechanism that performs a sequential frame selection and applies a reweighting process to demote frames that are visually-similar to the already selected ones. Experiments on two benchmark datasets (OVP and YouTube), using the top-3 matching evaluation protocol, show the competitiveness of RL-DiVTS against other SoA video thumbnail selection and summarization approaches from the literature.
Open Access
Paper
IEEE International Conference on Image Processing
Alberto Del Bimbo; Lorenzo Seidenari; Luca Cultrera;
University of Florence;
Out-of-Distribution (OOD) detection is a crucial challenge in computer vision, especially when deploying machine learning models in the real world. In this paper, we propose a novel OOD detection method leveraging Visual Attention Heatmaps from a Vision Transformer (ViT) classifier. Our approach involves training a Convolutional Autoencoder to reconstruct attention heatmaps produced by a ViT classifier, enabling accurate image reconstruction and effective OOD detection. Moreover, our method does not require additional labels during training, ensuring efficiency and ease of implementation. We validate our approach on a standard OOD benchmark using CIFAR10 and CIFAR100. To test OOD in a real-world setting we also collected a novel dataset: WildCapture. Our new dataset comprises more than 60k wild animal shots, from 15 different wildlife species, taken via phototraps in varying lighting conditions. The dataset is fully annotated with animal bounding boxes and species.
Open Access
Conference paper
N/A
Cristian-Nicolae Butincu; Florin Leon; Lavinia-Eugenia Ferariu; Marius Gavrilescu;
Technical University of Iasi
This report describes our research and documentation efforts in searching and analyzing the related literature for existing applications of evolutionary algorithms for quality-oriented optimization. We present our findings in terms of multiple relevant results from the related state of the art. We mainly divide the results into two broad categories: classic single- and multi-objective optimization, and quality-diversity (QD) methods. While we mostly focus on evolutionary optimization applied in visualization and image-processing, we also present some results from other fields which we considered relevant. This report was originally submitted as documentation for the deliverables of the VolEvol project.
Open Access
Report
N/A
Fabio Carrara; Fabrizio Falchi; Maurizio Tesconi;
ISTI-CNR; University of Pisa
Trends and opinion mining in social media increasingly focus on novel interactions involving visual media, like images and short videos, in addition to text.
In this work, we tackle the problem of visual sentiment analysis of social media images — specifically, the prediction of image sentiment polarity. While previous work relied on manually labeled training sets, we propose an automated approach for building sentiment polarity classifiers based on a cross-modal distillation paradigm; starting from scraped multimodal (text + images) data, we train a student model on the visual modality based on the outputs of a textual teacher model that analyses the sentiment of the corresponding textual modality.
We applied our method to randomly collected images crawled from Twitter over three months and produced, after automatic cleaning, a weakly-labeled dataset of ∼1.5 million images. Despite exploiting noisy labeled samples, our training pipeline produces classifiers showing strong generalization capabilities and outperforming the current state of the art on five manually labeled benchmarks for image sentiment polarity prediction.
Open Access
Publication
ECAI - European Conference on Artificial Intelligence
Ali Najm; Antonios Liapis; Despina Michael-Grigoriou; Emmanouil Xylakis; Georgios N. Yannakakis;
Cyprus University of Technology; University of Malta
Open Access
Conference paper
N/A
Ioanna Valsamara; Ioannis Pitas;
Aristotle University of Thessaloniki;
Open Access
Preprint
N/A
Andreas Sochopoulos; Evangelos Charalampakis; Ioannis Mademlis; Ioannis Pitas; Sotirios Papadopoulos
Aristotle University of Thessaloniki;
Open Access
Preprint
N/A
Dimitrios Papaioannou; Ioannis Pitas; Vasileios Mygdalis
Aristotle University of Thessaloniki;
Open Access
Preprint
N/A
Anestis Kaimakamadis; Ioannis Pitas;
Aristotle University of Thessaloniki;
Open Access
Preprint
N/A
Ioanna Valsamara; Ioannis Mademlis; Ioannis Pitas;
Aristotle University of Thessaloniki;
Open Access
Preprint
N/A
Emmanouil Krasanakis; Ioanna Valsamara; Ioannis Mademlis; Ioannis Pitas;
Aristotle University of Thessaloniki;
Open Access
Preprint
N/A
Hervé Le Borgne; Michel Crucianu; Nicolas Audebert; Perla Doubinsky
CNAM; Université Paris-Saclay;
The latent space of GANs contains rich semantics reflecting the training data. Different methods propose to learn edits in latent space corresponding to semantic attributes, thus allowing to modify generated images. Most supervised methods rely on the guidance of classifiers to produce such edits. However, classifiers can lead to out-of-distribution regions and be fooled by adversarial samples. We propose an alternative formulation based on the Wasserstein loss that avoids such problems, while maintaining performance on-par with classifier-based approaches. We demonstrate the effectiveness of our method on two datasets (digits and faces) using StyleGAN2.
Open Access
Conference paper
N/A
Alejandro Moreo; Fabrizio Sebastiani; Mirko Bunse; Pablo González
Consiglio Nazionale delle Ricerche; University of Oviedo
Open Access
Book
N/A
Florin Leon; Marius Gavrilescu;
Technical University of Iasi
The study of hurricanes through information visualization and visual analysis is useful for tracking and understanding the behavior and impact of such hazardous natural phenomena. Images obtained from data commonly acquired through meteorological radar provide scientists with a visual representation of the storm’s characteristics, such as its location, size, and intensity. Such information is useful for forecasting, decision making in disaster management and environmental and human health risk assessment. Visual representations of such phenomena can help emergency responders and policymakers make informed decisions about evacuations, disaster response, and resource allocation. In this context, we propose an automated means of generating representations from complex 3D datasets obtained from meteorological radar scans of regions affected by hurricanes, illustrating the geometry and spatial features of such phenomena.
Open Access
Conference paper
International Conference on Environmental Engineering and Management
Ioannis Mademlis; Ioannis Pitas; Michail Kaseris
Aristotle University of Thessaloniki;
Open Access
Preprint
N/A
Axel Roebel; Lenny Renault; Rémi Mignot;
Sorbonne Université
Open Access
Journal article
Journal of the Audio Engineering Society
Hannes Fassold;
Joanneum Research;
Manifold learning is an emerging research domain of machine learning. In this work, we give an introduction into manifold learning and how it is employed for important application fields in multimedia.
Open Access
Conference paper
Conference on Video and Signal Processing
Claudio Gennaro; Fabrizio Falchi; Gaetano Emanuele Valenti; Giuseppe Amato; Luca Ciampi; Nicola Messina;
ISTI-CNR; University of Pisa
Open Access
Conference paper
Conference on Image Analysis and Processing
Antonino Furnari; Claudio Gennaro; Fabrizio Falchi; Giovanni Maria Farinella; Nicola Messina;
ISTI-CNR; University of Catania;
Open Access
Journal article
Conference on Image Analysis and Processing
Axel Roebel; Lenny Renault; Rémi Mignot;
Sorbonne Université
Open Access
Conference paper
International Conference on Digital Audio Effects
Bruno Lepri; Linchao Bao; Marco de Nadai; Nicu Sebe; Yahui Liu Yajing Chen;
FBK; Tencent AI Lab University of Trento;
Closed Access
Journal article
IEEE Transactions on Multimedia
Marius Gavrilescu;
Technical University of Iasi
Objective quality assessment in volume visualization is a crucial process aimed at quantifying the quality of rendered volumetric images or animations using measurable metrics and algorithms. This approach is essential to ensure that the visualizations accurately represent the underlying data and meet specific quality standards. The assessment of quality in computer graphics, visualization and image processing is a complex task, particularly due to the number of scenarios, use cases and problems encountered in the aforementioned fields, and also due to the subjective nature of quality. To this extent, we search for methods, algorithms and metrics that can be used by an optimizer to search for rendering parameters such that the resulting images adhere to our formulations on what constitutes quality. At the same time, similar metrics can be exploited such that the space of possible parameters can be more thoroughly explored, resulting in populations of images exhibiting diverse content. This document presents our findings in terms of approaches that constitute good candidates for quality and diversity criteria, to be used as objectives and/or for defining feature spaces when automatically generating images from volume data. This report was originally submitted as documentation for the deliverables of the VolEvol project.
Open Access
Report
N/A
Antonios Liapis; Chintan Triverdi; Emmanouil Xylakis; Georgios N. Yannakakis; Konstantinos Makantasis; Kosmas Pinitas; Matthew Barthet
University of Malta
Open Access
Conference paper
Conference on Affective Computing and Intelligent Interaction Workshops and Demos
Christina Katsini; George E. Raptis; Vasilis Theodorou;
Human Opsis
In a fast-changing media ecosystem, professionals and enterprises in the News and Media industry face new challenges that they should address to maximize their productivity and improve their services. The rise of alternative news sources, such as social media, the leading news source, especially for young people, has led to emerging requirements in the News and Media industry. A core requirement is publishing articles as fast as possible on various platforms, combining visual and textual content. Accompanying news with images raises the readers’ interest, improves engagement, and recall. Therefore, the News and Media industry professionals must adapt their publication strategies to meet this requirement and the media consumers’ expectations. However, the selection of the appropriate images is a time-consuming and manual task. Towards this direction, we propose VIREO, which addresses this challenge by providing professionals (e.g., journalists) with an integrated digital solution that automatically recommends a collection of images that could accompany an article. VIREO implements text and image analysis and matching processes leveraging AI techniques in real time to achieve this. VIREO aims to benefit both professionals (e.g., journalists) by suggesting appealing images that accompany the textual content of their articles and create breath-taking stories and the media consumers (e.g., readers) by delivering an enhanced reading experience, engagement, and recall.
Open Access
Conference paper
Human-Computer Interaction
Ambrish Rawat; Gabriele Picco; Giulio Zizzo; Myles Foley; Taesung Lee; Yufang Hou;
IBM Research; Imperial College London;
The wide applicability and adaptability of large language models (LLM) has enabled their rapid adoption. While the pre-trained models can perform many tasks, such models are often fine-tuned to improve their performance. However, this leads to issues over violation of model licenses, model theft, and copyright infringement. Moreover, recent advances show that generative technology is capable of producing harmful content which exacerbates the problems of accountability within model supply chains. Thus, we need a method to investigate how a model was trained or a piece of text was generated and what their source pre-trained model was. In this paper we take a first step to addressing this open problem by tracing back the origin of a given fine-tuned LLM to its corresponding pre-trained base model. We consider different knowledge levels and attribution strategies, and find that we are able to trace back to the original base model with an AUC of 0.804.
Open Access
Conference paper
N/A
Ioannis Patras; Zengqun Zhao;
Queen Mary University of London;
Open Access
Conference paper
N/A
Aaron Duane; Cathal Gurrin; Florian Spiess; Jakub Lokoč; Klaus Schoeffmann; Konstantin Schall; Ladislav Peška; Loris Sauter; Luca Rossetto; Lucia Vadicamo; Nicola Messina; Omar Shahbaz Khan; Stefanos Vrochidis; Stelios Andreadis; Thao-Nhu Nguyen; Werner Bailer; Zhixin Ma;
CERTH - Center for Research and Technology Hellas Charles University; Dublin City University; HTW Berlin; ISTI-CNR; IT University of Copenhagen; Joanneum Research; Klagenfurt University; Singapore Management University; University of Basel; University of Copenhagen; University of Zurich;
This paper presents the findings of the eleventh Video Browser Showdown competition, where sixteen teams competed in known-item and ad-hoc search tasks. Many of the teams utilized state-of-the-art video retrieval approaches that demonstrated high effectiveness in challenging search scenarios. In the paper, a broad survey of all utilized approaches is presented in connection
with an analysis of the performance of participating teams. Specifically, both high-level performance indicators are presented with overall statistics as well as an in-depth analysis of the performance of selected tools implementing result set logging. The analysis reveals evidence that the CLIP model represents a versatile tool for cross-modal video retrieval when combined with interactive search capabilities. Furthermore, the analysis investigates the effect of different users and text query properties on the performance in search tasks. Last but not least, lessons learned from search task preparation are presented, and a new direction for adhoc search based tasks at Video Browser Showdown is introduced.
Open Access
Journal article
N/A
Ioannis Pitas; Vasileios Mygdalis
Aristotle University of Thessaloniki;
This work examines the problem of increasing the robustness of deep neural network-based image classification systems to adversarial attacks, without changing the neural architecture or employ adversarial examples in the learning process. We attribute their famous lack of robustness to the geometric properties of the deep neural network embedding space, derived from standard optimization options, which allow minor changes in the intermediate activation values to trigger dramatic changes to the decision values in the final layer. To counteract this effect, we explore optimization criteria that supervise the distribution of the intermediate embedding spaces, in a class-specific basis, by introducing and leveraging one-class classification objectives. The proposed learning procedure compares favorably to recently proposed training schemes for adversarial robustness in black-box adversarial attack settings.
Open Access
Conference paper
N/A
Alexandros Zamichos; Ioannis Pitas; Vasileios Mygdalis
Aristotle University of Thessaloniki;
Adversarial attacks in image classification are optimization problems that estimate the minimum perturbation required for a single input image, so the neural network misclassifies it. Universal adversarial perturbations are adversarial attacks that target a whole dataset, estimated by e.g., accumulating the perturbations for each image using standard adversarial attacks. This work treats the universal adversarial perturbation as a problem of transformation estimation. As such, we propose to learn an iterative transformation that maps “clean” images to a “perturbed” domain, by exploiting adversarial attacks. Our experiments show that the proposed formulation leads to easy generation of the adversarial perturbation, while it introduces less noise in the perturbed images, when compared to the state-of-the-art. Finally, this formulation allows us to explore additional properties, notably reversibility of the transformation and attainability of the transformation by using dataset samples.
Open Access
Conference paper
N/A
Ioannis Pitas; Stefania Altini; Vasileios Mygdalis
Aristotle University of Thessaloniki;
Different adversarial attack methods have been proposed in the literature, mainly focusing on attack efficiency and visual quality, e.g., similarity with the non-adversarial examples. These properties enable the use of adversarial attacks for privacy protection against automated classification systems, while maintaining utility for human users. In this paradigm, when privacy restrictions are lifted, access to the original data should be restored, for all stakeholders. This paper addresses exactly this problem. Existing adversarial attack methods cannot reconstruct the original data from the adversarial ones, leading to significant storage overhead for all privacy applications. To solve this issue, we propose AdvRevGAN, a novel Neural Network architecture that generates reversible adversarial examples. We evaluate our approach in classification problems, where we examine the case where adversarial attacks are constructed by a neural network, while the original images are reconstructed using the reverse transformation from the adversarial examples. We show that adversarial attacks using this approach maintain and even increase their efficiency, while the classification accuracy of the model in the reconstructed data can almost totally be restored.
Open Access
Conference paper
N/A
Daniel Aláez; Ioannis Pitas; Jesús Villadangos; Vasileios Mygdalis
Aristotle University of Thessaloniki; University of Navarre;
In recent years, the field of automated aerial cinematography has seen a significant increase in demand for real-time 3D target geopositioning for motion and shot planning. To this end, many of the existing cinematography plans require the use of complex sensors that need to be equipped on the subject or rely on external motion systems. This work addresses this problem by combining monocular visual target detection and tracking with a simple ground intersection model. Under the assumption that the targets to be filmed typically stand on the ground, 3D target localization is achieved by estimating the direction and the norm of the look-at vector. The proposed algorithm employs an error estimation model that accounts for the error in detecting the bounding box, the height estimation errors, and the uncertainties of the pitch and yaw angles. This algorithm has been fully implemented in a heavy-lifting aerial cinematography hexacopter, and its performance has been evaluated through experimental flights. Results show that typical errors are within 5 meters of absolute distance and 3 degrees of angular error for distances to the target of around 100 meters.
Open Access
Conference paper
N/A
Christos Tzelepis; Georgios Tzimiropoulos; Ioannis Patras; Stella Bounareli; Vasileios Argyriou;
Kingston University London; Queen Mary University of London;
Open Access
Conference paper
N/A
Adrian Popescu; Armelle Brun; Evan Dufraisse; Jérôme Deshayes-Chossart; Julien Tourille;
Université de Lorraine; Université Paris-Saclay;
Target-dependent sentiment classification (TSC) enables a fine-grained automatic analysis of sentiments expressed in texts.
Sentiment expression varies depending on the domain, and it is necessary to create domain-specific datasets.
While socially important, TSC in the news domain remains relatively understudied.
We introduce MAD-TSC, the first multilingual aligned dataset designed for TSC in news. MAD-TSC differs substantially from existing resources.
First, it includes aligned examples in eight languages to facilitate a comparison of performance for individual languages, and a direct comparison of human and machine translation.
Second, the dataset is sampled from a diversified parallel news corpus, and is diversified in terms of news sources and geographic spread of entities.
Finally, MAD-TSC is more challenging than existing datasets because its samples are more complex.
We exemplify the use of MAD-TSC with comprehensive monolingual and multilingual experiments.
The latter shows that machine translations can successfully replace manual ones, and that performance for all included languages can match that of English by automatically translating test examples.
Open Access
Conference paper
Conference on Computational Linguistics N/A
Daniele Ugo Leonzio; Luca Cuccovillo; Marco Marcon; Paolo Bolettieri; Patrick Aichroth; Stefano Tubaro;
Fraunhofer IDMT; Politecnico di Milano;
In recent years, the multimedia forensic community has put a great effort in developing solutions to assess the integrity and authenticity of multimedia objects, focusing especially on manipulations applied by means of advanced deep learning techniques. However, in addition to complex forgeries as the deepfakes, very simple yet effective manipulation techniques not involving any use of state-of-the-art editing tools still exist and prove dangerous. This is the case of audio splicing for speech signals, i.e., to concatenate and combine multiple speech segments obtained from different recordings of a person in order to cast a new fake speech. Indeed, by simply adding a few words to an existing speech we can completely alter its meaning. In this work, we address the overlooked problem of detection and localization of audio splicing from different models of acquisition devices. Our goal is to determine whether an audio track under analysis is pristine, or it has been manipulated by splicing one or multiple segments obtained from different device models. Moreover, if a recording is detected as spliced, we identify where the modification has been introduced in the temporal dimension. The proposed method is based on a Convolutional Neural Network (CNN) that extracts model-specific features from the audio recording. After extracting the features, we determine whether there has been a manipulation through a clustering algorithm. Finally, we identify the point where the modification has been introduced through a distance-measuring technique. The proposed method allows to detect and localize multiple splicing points within a recording.
Open Access
Journal article
Multimedia FORensics in the WILD
Alberto Del Bimbo; Federico Becattini; Lorenzo Seidenari; Luca Cultrera;
University of Florence;
Autonomous driving is advancing at a fast pace, with driving algorithms becoming more and more accurate and reliable. Despite this, it is of utter importance to develop models that can offer a certain degree of explainability in order to be trusted, understood and accepted by researchers and, especially, society. In this work we present a conditional imitation learning agent based on a visual attention mechanism in order to provide visually explainable decisions by design. We propose different variations of the method, relying on end-to-end trainable regions proposal functions, generating regions of interest to be weighed by an attention module. We show that visual attention can improve driving capabilities and provide at the same time explainable decisions.
Open Access
Journal article
N/A
Nicu Sebe; Wei Wang; Yue Song
Beijing Jiaotong University; University of Trento;
Closed Access
Journal article
IEEE Transaction on Pattern Analysis and Machine Intelligence
Fang Li; Jing Wang; Jun Zhang; Wengjing Li; Zhongcheng Wu; Zhun Zhong
Chinese Academy of Science; University of Trento;
Closed Access
Journal article
Transactions on Intelligent Transportation Systems;
Andy Keller; Max Welling; Nicu Sebe; Yue Song
University of Amsterdam; University of Trento;
Open Access
Conference paper
International Conference on Machine Learning
Emmanouil Krasanakis; Ioannis Kompatsiaris; Symeon Papadopoulos
CERTH - Center for Research and Technology Hellas
Open Access
Journal article
N/A
Bin Ren; Hao Tang; Nicu Sebe; Wei Wang; Xia Li; Yiming Wang;
Beijing Jiaotong University; ETH Zurich; FBK; University of Trento;
For semantic-guided cross-view image translation, it is crucial to learn where to sample pixels from the source view image and where to reallocate them guided by the target view semantic map, especially when there is little overlap or drastic view difference between the source and target images. Hence, one not only needs to encode the longrange dependencies among pixels in both the source view image and target view semantic map but also needs to translate these learned dependencies. To this end, we propose a novel generative adversarial network, PI-Trans, which mainly consists of a novel Parallel-ConvMLP module and an Implicit Transformation module at multiple semantic levels. Extensive experimental results show that PI-Trans achieves the best qualitative and quantitative performance by a large margin compared to the state-of-the-art methods on two challenging datasets. The source code is available at https://github.com/Amazingren/PI-Trans.
Open Access
Conference paper
N/A
Chen Feng; Ioannis Patras;
Queen Mary University of London;
Deep learning has achieved great success in recent years with the aid of advanced neural network structures and large-scale human-annotated datasets. However, it is often costly and difficult to accurately and efficiently annotate large-scale datasets, especially for some specialized domains where fine-grained labels are required. In this setting, coarse labels are much easier to acquire as they do not require expert knowledge. In this work, we propose a contrastive learning method, called masked contrastive learning (MaskCon) to address the under-explored problem setting, where we learn with a coarse-labelled dataset in order to address a finer labelling problem. More specifically, within the contrastive learning framework, for each sample our method generates soft-labels with the aid of coarse labels against other samples and another augmented view of the sample in question. By contrast to self-supervised contrastive learning where only the sample’s augmentations are considered hard positives, and in supervised contrastive learning where only samples with the same coarse labels are considered hard positives, we propose soft labels based on sample distances, that are masked by the coarse labels. This allows us to utilize both inter-sample relations and coarse labels. We demonstrate that our method can obtain as special cases many existing state-of-the-art works and that it provides tighter bounds on the generalization error. Experimentally, our method achieves significant improvement over the current state-of-the-art in various datasets, including CIFAR10, CIFAR100, ImageNet-1K, Standford Online Products and Stanford Cars196 datasets. Code and annotations are available at https://github.com/ MrChenFeng/MaskCon_CVPR2023.
Open Access
Conference paper
N/A
Christos Tzelepis; Giorgios Kordopatis-Zilos; Giorgios Tolias; Ioannis Kompatsiaris; Ioannis Patras; Symeon Papadopoulos
CERTH - Center for Research and Technology Hellas Czech Technical University in Prague; Queen Mary University of London;
We introduce S2VS, a video similarity learning approach with self-supervision. Self-Supervised Learning (SSL) is typically used to train deep models on a proxy task so as to have strong transferability on target tasks after fine-tuning. Here, in contrast to prior work, SSL is used to perform video similarity learning and address multiple retrieval and detection tasks at once with no use of labeled data. This is achieved by learning via instance-discrimination with task-tailored augmentations and the widely used InfoNCE loss together with an additional loss operating jointly on self-similarity and hard-negative similarity. We benchmark our method on tasks where video relevance is defined with varying granularity, ranging from video copies to videos depicting the same incident or event. We learn a single universal model that achieves state-of-the-art performance on all tasks, surpassing previously proposed methods that use labeled data. The code and pretrained models are publicly available at: https://github.com/gkordo/s2vs.
Open Access
Conference paper
N/A
Chen Feng; Ioannis Patras;
Queen Mary University of London;
Open Access
Conference paper
N/A
Alberto Del Bimbo; Andrea Leonardo Chiara Albisani Federico Becattini; Lisa Cresti Lorenzo Berlincioni; Luca Cultrera; Sara Picchioni
Università degli Studi di Firenze; University of Florence;
Recently, event cameras have shown large applicability in several computer vision fields especially concerning tasks that require high temporal resolution. In this work, we investigate the usage of such kind of data for emotion recognition by presenting NEFER, a dataset for Neuromorphic Event-based Facial Expression Recognition. NEFER is composed of paired RGB and event videos representing human faces labeled with the respective emotions and also annotated with face bounding boxes and facial landmarks. We detail the data acquisition process as well as providing a baseline method for RGB and event data. The collected data captures subtle micro-expressions, which are hard to spot with RGB data, yet emerge in the event domain. We report a double recognition accuracy for the event-based approach, proving the effectiveness of a neuromorphic approach for analyzing fast and hardly detectable expressions and the emotions they conceal.
Open Access
Conference paper
Computer Vision Foundation
Alessandrio Betti Frédéric Precioso Gabriele Ciravegna Kevin Mottin Marco Gori
Politecnico di Torino Université Côte d'Azur; University of Siena
The deployment of Deep Learning (DL) models is still precluded in those contexts where the amount of supervised data is limited.
To answer this issue, active learning strategies aim at minimizing the amount of labelled data required to train a DL model. Most active strategies are based on uncertain sample selection, and even often restricted to samples lying close to the decision boundary. These techniques are theoretically sound, but an understanding of the selected samples based on
their content is not straightforward, further driving non-experts to consider DL as a black-box. For the first time, here we propose to take into consideration common domain-knowledge and enable non-expert users to train a model with fewer samples. In our Knowledge-driven Active Learning (KAL) framework, rule-based knowledge is converted into logic constraints and their violation is checked as a natural guide for sample selection. We show that even simple relationships among data and output classes offer a way to spot predictions for which the model need supervision. We empirically show that KAL (i) outperforms many active learning strategies, particularly in those contexts where domain knowledge is rich, (ii) it discovers data distribution lying far from the initial training data, (iii) it ensures domain experts that the provided knowledge is acquired by the model, (iv) it is suitable for regression and object recognition tasks unlike uncertainty-based strategies, and (v) its computational demand is low.
Open Access
Conference paper
N/A
Fabrizio Falchi; Jan Sedmidubsky; Nicola Messina; Tomás Rebok;
ISTI-CNR; Masaryk University;
Open Access
Conference paper
N/A
Giorgios Kordopatis-Zilos; Ioannis Kompatsiaris; Pantelis Dogoulis; Symeon Papadopoulos
CERTH - Center for Research and Technology Hellas
New advancements for the detection of synthetic images are critical for fighting disinformation, as the capabilities of generative AI models continuously evolve and can lead to hyper-realistic synthetic imagery at unprecedented scale and speed. In this paper, we focus on the challenge of generalizing across different concept classes, e.g., when training a detector on human faces and testing on synthetic animal images — highlighting the ineffectiveness of existing approaches that randomly sample generated images to train their models. By contrast, we propose an approach based on the premise that the robustness of the detector can be enhanced by training it on realistic synthetic images that are selected based on their quality scores according to a probabilistic quality estimation model. We demonstrate the effectiveness of the proposed approach by conducting experiments with generated images from two seminal architectures, StyleGAN2 and Latent Diffusion, and using three different concepts for each, so as to measure the cross-concept generalization ability. Our results show that our quality-based sampling method leads to higher detection performance for nearly all concepts, improving the overall effectiveness of the synthetic image detectors.
Open Access
Conference paper
N/A
Adrian Popescu; Bogdan Ionescu; Giorgios Kordopatis-Zilos; Luca Cuccovillo; Symeon Papadopoulos
CERTH - Center for Research and Technology Hellas Czech Technical University in Prague; Fraunhofer IDMT; Université Paris-Saclay; University Politehnica of Bucharest
With recent advancements in synthetic media manipulation and generation, verifying multimedia content posted online has become increasingly difficult. Additionally, the malicious exploitation of AI technologies by actors to disseminate disinformation on social media, and more generally the Web, at an alarming pace poses significant threats to society and democracy. Therefore, the development of AI-powered tools that facilitate media verification is urgently needed. The MAD ’23 workshop aims to bring together individuals working on the wider topic of detecting disinformation in multimedia to exchange their experiences and discuss innovative ideas, attracting people with varying backgrounds and expertise. The research areas of interest include identifying manipulated and synthetic content in multimedia, as well as examining the dissemination of disinformation and its impact on society. The multimedia aspect is very important since content most often contains a mix of modalities and their