Andrea Esuli; Claudio Gennaro; Davide Alessandro Coccomini; Fabrizio Falchi; Giuseppe Amato;
ISTI-CNR;
Open Access
Publication
N/A
Christos Koutlis; Symeon Papadopoulos
CERTH;
The recently developed and publicly available synthetic image generation methods and services make it possible to create extremely realistic imagery on demand, raising great risks for the integrity and safety of online information. State-of-the-art Synthetic Image Detection (SID) research has led to strong evidence on the advantages of feature extraction from foundation models. However, such extracted features mostly encapsulate high-level visual semantics instead of fine-grained details, which are more important for the SID task. On the contrary, shallow layers encode low-level visual information. In this work, we leverage the image representations extracted by intermediate Transformer blocks of CLIP’s image-encoder via a lightweight network that maps them to a learnable forgery-aware vector space capable of generalizing exceptionally well. We also employ a trainable module to incorporate the importance of each Transformer block to the final prediction. Our method is compared against the state-of-the-art by evaluating it on 20 test datasets and exhibits an average +10.6% absolute performance improvement. Notably, the best performing models require just a single epoch for training (~8 minutes). Code available at https://github.com/mever-team/rine.
Open Access
Conference paper
European Conference on Computer Vision
Hannes Fassold;
Joanneum Research;
The detection of shot boundaries (hardcuts and short dissolves), sampling structure (progressive / interlaced / pulldown) and dynamic keyframes in a video are fundamental video analysis tasks which have to be done before any further high-level analysis tasks. We present a novel algorithm which does all these analysis tasks in an unified way, by utilizing a combination of inter-frame and intra-frame measures derived from the motion field and normalized cross correlation. The algorithm runs four times faster than real-time due to sparse and selective calculation of these measures.
Open Access
Publication
Conference on Imaging, Signal Processing and Communication
Tobias Blanke;
University of Amsterdam;
Archives have long been a key concern of academic debates about truth, memory, recording and power and are important sites for social sciences and humanities research. This has been the case for traditional archives, but these debates have accelerated with the digital transformation of archives. The proliferation of digital tools and the fast-growing increase in digital materials have created very large digitised and born-digital archives. This article investigates how new digital archives continue existing archival practices while at the same time discontinuing them. We present novel methodologies and tools for changing memory and power relations in digital archives through new ways of reassembling marginalised, non-canonical entities in digital archives. Reassembling digital archives can take advantage of the materiality and the algorithmic processuality of digital collections and reshape them to inscribe lost voices and previously ignored differences. Digital archives are not fixed and are changed with new research and political questions and are only identified through new questions. The article presents six distinct techniques and strategies to reassemble digital archives and renders these according to three different types of new digital archives. We consider both the extension of archives towards evidence that is otherwise thrown away as well as the provision of new intensive, non-discriminatory viewpoints on existing collections.
Open Access
Journal article
N/A
Emmanouil Krasanakis; Ioannis Kompatsiaris; Symeon Papadopoulos
CERTH;
Being able to express broad families of equivariant or invariant attributed graph functions is a popular measuring stick of whether graph neural networks should be employed in practical applications. However, it is equally important to find deep local minima of losses (i.e., produce outputs with much smaller loss values compared to other minima), even when architectures cannot express global minima. In this work we introduce the architectural property of attracting optimization trajectories to local minima as a means of achieving smaller loss values. We take first steps in satisfying this property for losses defined over attributed undirected unweighted graphs with an architecture called universal local attractor (ULA). This refines each dimension of end-to-end-trained node feature embeddings based on graph structure to track the optimization trajectories of losses satisfying some mild conditions. The refined dimensions are then linearly pooled to create predictions. We experiment on 11 tasks, from node classification to clique detection, on which ULA is comparable with or outperforms popular alternatives of similar or greater theoretical expressive power.
Open Access
Publication
N/A
Antonios Liapis; Georgios N. Yannakakis; Marvin Zammit;
University of Malta
The recent advances in language-based generative models have paved the way for the orchestration of multiple generators of different artefact types (text, image, audio, etc.) into one system. Presently, many open-source pre-trained models combine text with other modalities, thus enabling shared vector embeddings to be compared across different generators. Within this context we propose a novel approach to handle multimodal creative tasks using Quality Diversity evolution. Our contribution is a variation of the MAP-Elites algorithm, MAP-Elites with Transverse Assessment (MEliTA), which is tailored for multimodal creative tasks and leverages deep learned models that assess coherence across modalities. MEliTA decouples the artefacts’ modalities and promotes cross-pollination between elites. As a test bed for this algorithm, we generate text descriptions and cover images for a hypothetical video game and assign each artefact a unique modality-specific behavioural characteristic. Results indicate that MEliTA can improve text-to-image mappings within the solution space, compared to a baseline MAP-Elites algorithm that strictly treats each image-text pair as one solution. Our approach represents a significant step forward in multimodal bottom-up orchestration and lays the groundwork for more complex systems coordinating multimodal creative agents in the future.
Open Access
Conference paper
N/A
Hannes Fassold;
Joanneum Research;
Deploying Large Language Models (LLMs) on mobile devices makes all the capabilities of natural language processing available on the device. An important use case of LLMs is question answering, which can provide accurate and contextually relevant answers to a wide array of user queries. We describe how we managed to port state of the art LLMs to mobile devices, enabling them to operate natively on the device. We employ the llama.cpp framework, a flexible and self-contained C++ framework for LLM inference. We selected a 6-bit quantized version of the Orca-Mini-3B model with 3 billion parameters and present the correct prompt format for this model. Experimental results show that LLM inference runs in interactive speed on a Galaxy S21 smartphone and that the model delivers high-quality answers to user queries related to questions from different subjects like politics, geography or history.
Open Access
Conference paper
N/A
Konstantinos Gkrispanis; Nikolaos Gkalelis; Vasileios Mezaris
CERTH;
Face detectors are becoming a crucial component of many applications, including surveillance, that often have to run on edge devices with limited processing power and memory. Therefore, there’s a pressing demand for compact face detection models that can function efficiently across resource-constrained devices. Over recent years, network pruning techniques have attracted a lot of attention from researchers. These methods haven’t been well examined in the context of face detectors, despite their expanding popularity. In this paper, we implement filter pruning on two already small and compact face detectors, named EXTD (Extremely Tiny Face Detector) and EResFD (Efficient ResNet Face Detector). The main pruning algorithm that we utilize is Filter Pruning via Geometric Median (FPGM), combined with the Soft Filter Pruning (SFP) iterative procedure. We also apply L1 Norm pruning, as a baseline to compare with the proposed approach. The experimental evaluation on the WIDER FACE dataset indicates that the proposed approach has the potential to further reduce the model size of already lightweight face detectors, with limited accuracy loss, or even with small accuracy gain for low pruning rates.
Open Access
Conference paper
Winter Conference on Applications of Computer Vision
Evlampios Apostolidis; Konstantinos Apostolidis; Vasileios Mezaris
CERTH;
This paper presents a web-based tool that facilitates the production of tailored summaries for online sharing on social media. Through an interactive user interface, it supports a “one-click” video summarization process. Based on the integrated AI models for video summarization and aspect ratio transformation, it facilitates the generation of multiple summaries of a full-length video according to the needs of target platforms with regard to the video’s length and aspect ratio.
Open Access
Conference paper
Conference on Multimedia Modeling
Evlampios Apostolidis; Ioannis Kontostathis; Vasileios Mezaris
CERTH;
In this work, we present an integrated system for spatiotemporal summarization of 360-degrees videos. The video summary production mainly involves the detection of salient events and their synopsis into a concise summary. The analysis relies on state-of-the-art methods for saliency detection in 360-degrees video (ATSal and SST-Sal) and video summarization (CA-SUM). It also contains a mechanism that classifies a 360-degrees video based on the use of static or moving camera during recording and decides which saliency detection method will be used, as well as a 2D video production component that is responsible to create a conventional 2D video containing the salient events in the 360-degrees video. Quantitative evaluations using two datasets for 360-degrees video saliency detection (VR-EyeTracking, Sports-360) show the accuracy and positive impact of the developed decision mechanism, and justify our choice to use two different methods for detecting the salient events. A qualitative analysis using content from these datasets, gives further insights about the functionality of the decision mechanism, shows the pros and cons of each used saliency detection method and demonstrates the advanced performance of the trained summarization method against a more conventional approach.
Open Access
Conference paper
Conference on Multimedia Modeling
Bogdad Ionescu; Hannes Fassold; Mihai Dogariu Werner Bailer;
Joanneum Research; University Politehnica of Bucharest
Open Access
Conference paper
Multimedia Modeling Conference
Alberto Messina; Angelo Bruccoleri; Fulvio Negro; Maurizio Montagnuolo; Roberto Iacoviello;
RAI;
Knowledge about the presence of people in a video is a valuable source of information in many applications, such as video annotation, retrieval and summarisation. The contribution of this paper goes in the direction of demonstrating how AI-based face processing technologies can be profitably used to perform video annotation of television content. To validate our vision, we developed the Face Management Framework (FMF), which implements an end-to-end pipeline for face analysis and content annotation based on few-shot or zero-shot face embedding extraction models. The results of the test campaign of the system show that the key performance indicators that we defined were exceeded by a wide margin, demonstrating how media workflows could greatly benefit from the tool and the efficiency improvements it brings.
Open Access
Conference paper
International Conference on Big Data
Ambrish Rawat; Anisa Halimi; Nathalie Baracaldo; Swanand Kadhe;
IBM Research;
Training large language models (LLMs) is a costly endeavour in terms of time and computational resources. The large amount of training data used during the unsupervised pre-training phase makes it difficult to verify all data and, unfortunately, undesirable data may be ingested during training. Re-training from scratch is impractical and has led to the creation of the unlearning discipline where models are modified to “unlearn” undesirable information without retraining. However, any modification can alter the behaviour of LLMs, especially on key dimensions such as fairness. This is the first work that examines this interplay between unlearning and fairness for LLMs. In particular, we focus on a popular unlearning framework known as SISA [Bourtoule et al., 2021], which creates an ensemble of models trained on disjoint shards. We evaluate the performance-fairness trade-off for SISA, and empirically demsontrate that SISA can indeed reduce fairness in LLMs. To remedy this, we propose post-processing bias mitigation techniques for ensemble models produced by SISA. We adapt the post-processing fairness improvement technique from [Hardt et al., 2016] to design three methods that can handle model ensembles, and prove that one of the methods is an optimal fair predictor for ensemble of models. Through experimental results, we demonstrate the efficacy of our post-processing framework called FairSISA.
Open Access
Conference paper
Socially Responsible Language Modelling Research
Luca Cuccovillo; Milica Gerhardt; Patrick Aichroth;
Fraunhofer IDMT;
In this study we propose a novel approach to audio phylogeny, i.e. the detection of relationships and transformations within a set of near-duplicate audio items, by leveraging a deep neural network for efficiency and extensibility. Unlike existing methods, our approach detects transformations between nodes in one step, and the transformation set can be expanded by retraining the neural network without excessive computational costs. We evaluated our method against the state of the art using a self-created and publicly released dataset, observing a superior performance in reconstructing phylogenetic trees and heightened transformation detection accuracy. Moreover, the ability to detect a wide range of transformations and to extend the transformation set make the approach suitable for various applications.
Open Access
Conference paper
IEEE International Workshop of Information Forensics and Security
Christina Katsini; George E. Raptis; Vasilis Theodorou;
Human Opsis
The News and Media landscape has undergone significant transformations in recent years, driven by the rise of new technologies and the widespread use of social media. This evolution introduces unique challenges for professionals working within this environment (e.g., journalists, content creators, and news authors), with a major one being the efficient sourcing of images that complement article content. In response to this challenge, we developed VIREO, a tool that recommends images based on textual content. In this paper, we make a step towards the practical effectiveness of VIREO’s core models in recommending images for real-world articles, with a specific focus on image recommendation efficiency. Our results indicate that VIREO offers a promising solution for professionals seeking to meet the evolving demands of the News and Media landscape while maintaining content quality and engagement.
Open Access
Conference paper
International Conference on Computer and Applications
Angelo Canale; Fabrizio Falchi; Giovanni Benelli; Giuseppe Amato; Luca Ciampi; Luca Incrocci; Stefano Chessa; Valeria Zeni;
ISTI-CNR; University of Pisa
Integrated Pest Management (IPM) is an essential approach used in smart agriculture to manage pest populations and sustainably optimize crop production. One of the cornerstones underlying IPM solutions is pest monitoring, a practice often performed by farm owners by using chromotropic sticky traps placed on insect hot spots to gauge pest population densities. In this paper, we propose a modular model-agnostic deep learning-based counting pipeline for estimating the number of insects present in pictures of chromotropic sticky traps, thus reducing the need for manual trap inspections and minimizing human effort. Additionally, our solution generates a set of raw positions of the counted insects and confidence scores expressing their reliability, allowing practitioners to filter out unreliable predictions. We train and assess our technique by exploiting PST – Pest Sticky Traps, a new collection of dot-annotated images we created on purpose and we publicly release, suitable for counting whiteflies. Experimental evaluation shows that our proposed counting strategy can be a valuable Artificial Intelligence-based tool to help farm owners to control pest outbreaks and prevent crop damages effectively. Specifically, our solution achieves an average counting error of approximately 9% compared to human capabilities requiring a matter of seconds, a large improvement respecting the time-intensive process of manual human inspections, which often take hours or even days.
Open Access
Journal article
Ecological Informatics
Hervé Le Borgne; Michel Crucianu; Nicolas Audebert; Perla Doubinsky
CEA; Conservatoire National des Arts et Métiers;
With the availability of powerful text-to-image diffusion models, recent works have explored the use of synthetic data to improve image classification performances. These works show that it can effectively augment or even replace real data. In this work, we investigate how synthetic data can benefit few-shot class-agnostic counting. This requires to generate images that correspond to a given input number of objects. However, text-to-image models struggle to grasp the notion of count. We propose to rely on a double conditioning of Stable Diffusion with both a prompt and a density map in order to augment a training dataset for few-shot counting. Due to the small dataset size, the fine-tuned model tends to generate images close to the training images. We propose to enhance the diversity of synthesized images by exchanging captions between images thus creating unseen configurations of object types and spatial layout. Our experiments show that our diversified generation strategy significantly improves the counting accuracy of two recent and performing few-shot counting models on FSC147 and CARPK.
Open Access
Conference paper
Winter Conference on Applications of Computer Vision
Alberto Messina; Stefano Scotta;
RAI;
In this work, we present an example of how a relatively small Large Language Model (LLM) fine-tuned to perform a simple and well defined task (assigning titles to news articles) could perform similarly or even better than huge LLMs which are created to respond to any question. This approach of specializing smaller LLMsonsimplertasksisalsointeresting because it goes in the direction of makingthis technology more sustainable and available to a higher number of entities that usually could not use these expensive models, both for economic and data policy reasons. We also present a couple of examples of how can be evaluated the performances of LLMs when the task is specified as in the example that we present in this work.
Open Access
Conference paper
International Conference of the Italian Association for Artificial Intelligence
Marius Gavrilescu;
Technical University of Iasi
The identification of important structures from volume data is a challenging problem in information visualization due to the complexity and amount of detail found in volume data sets. In particular, medical imaging devices generate scans which contain a significant amount of important anatomical structures, some of which are hidden, occluded or otherwise difficult to highlight. Conventional density and gradient-based classification methods fail to uncover such structures, thereby creating the necessity for more elaborate visualization methods and the involvement of multiple visual criteria in order to generate quality representations of the volume data. We propose a volume visualization approach which extends the conventional rendering pipeline by incorporating visibility-based quality criteria into the color and opacity mapping process. Our method consists in using two stacked transfer functions which handle visual mappings: one based on the density domain of the data set, and the other on a custom metric which quantifies the visibility of volumetric structures. We show that this arrangement allows the generation of improved representations of meaningful hidden structures from medical CT data, while constituting a reliable means of identifying volumetric details not representable using traditional approaches.
Open Access
Conference paper
E-Health and Bioengineering Conference 2023
Evlampios Apostolidis; Ioannis Patras; Vasileios Mezaris
CERTH; Queen Mary University of London;
In this paper we present our study on the use of attention for explaining video summarization. We build on a recent work that formulates the task, called XAI-SUM, and we extend it by: a) taking into account two additional network architectures and b) introducing two novel explanation signals that relate to the entropy and diversity of attention weights. In total, we examine the effectiveness of seven types of explanation, using three state-of-the-art attention-based network architectures (CA-SUM, VASNet, SUM-GDA) and two datasets (SumMe, TVSum) for video summarization. The conducted evaluations show that the inherent attention weights are more suitable for explaining network architectures which integrate mechanisms for estimating attentive diversity (SUM-GDA) and uniqueness (CA-SUM). The explanation of simpler architectures (VASNet) can benefit from taking into account estimates about the strength of the input vectors, while another option is to consider the entropy of attention weights.
Open Access
Conference paper
ACM Multimedia
Artem Yaroshchuk; Christoforos Papastergiopoulos Dimitrios Tzovaras; Konstantinos Votis; Luca Cuccovillo; Patrick Aichroth;
CERTH; Fraunhofer IDMT;
This paper introduces a multilingual, multispeaker dataset composed of synthetic and natural speech, designed to foster research and benchmarking in synthetic speech detection. The dataset encompasses 18,993 audio utterances synthesized from text, alongside with their corresponding natural equivalents, representing approximately 17 hours of synthetic audio data. The dataset features synthetic speech generated by 156 voices spanning three languages, namely, English, German, and Spanish, with a balanced gender representation. It targets state-of-the-art synthesis methods, and has been released with a license allowing seamless extension and redistribution by the research community.
Open Access
Conference paper
IEEE International Workshop of Information Forensics and Security
Claudio Gennaro; Fabio Carrara; Giuseppe Amato; Lucia Vadicamo;
ISTI-CNR;
The rapid development of deep learning and artificial intelligence has transformed our approach to solving scientific problems across various domains, including computer vision, natural language processing, and automatic content generation. Information retrieval (IR) has also experienced significant advancements, with natural language understanding and multimodal content analysis enabling accurate information retrieval. However, the widespread adoption of neural networks has also influenced the focus of IR problem-solving, which nowadays predominantly relies on evaluating the similarity of dense vectors derived from the latent spaces of deep neural networks. Nevertheless, the challenges of conducting similarity searches on large-scale databases with billions of vectors persist. Traditional IR approaches use inverted indices and vector space models, which work well with sparse vectors. In this paper, we propose Vec2Doc, a novel method that converts dense vectors into sparse integer vectors, allowing for the use of inverted indices. Preliminary experimental evaluation shows a promising solution for large-scale vector-based IR problems.
Open Access
Conference paper
International Conference on Similarity Search and Applications
Evlampios Apostolidis; Georgios Balaouras; Ioannis Patras; Vasileios Mezaris
CERTH; Queen Mary University of London;
This chapter focuses on explainable video summarization, a technology that could significantly advance the content production workflow of Media organizations. It starts by presenting the current state of the art in the fields of deep-learning-based video summarization and explainable video analysis and understanding. Following, it focuses on video summarization methods that rely on the use of attention mechanisms and reports on previous works that investigated the use of attention for explaining the outcomes of deep neural networks. Subsequently, it briefly describes a state-of-the-art attention-based architecture for unsupervised video summarization and discusses a recent work that examines the use of various attention-based signals for explaining the outcomes of video summarization. Finally, it provides recommendations about future research directions.
Open Access
Book section
Encyclopedia of Information Science and Technology
Fabrizio Falchi; Jan Sedmidubsky; Nicola Messina; Tomás Rebok;
ISTI-CNR; Masaryk University;
Open Access
Conference paper
N/A
Florin Leon; Marius Gavrilescu; Sabina-Adriana Floria;
Technical University of Iasi
Representing relevant information from volume data sets is a problem often faced in visualization. Generating meaningful images from highly-complex volume data sets is a challenging, tedious task requiring specialized knowledge of the distribution and properties of the data. Traditionally, this task has been carried out manually via specialized user interfaces. We propose a volume visualization pipeline which facilitates the automatic generation of high-quality images from volume data sets. Our method involves a direct volume renderer which generates images from volume data based on visual mappings provided by a transfer function. Central to our approach is a quality-focused descriptor which exploits the properties of the distribution of gradient orientations of an alpha-bounded surface within the volume. This feature is useful for determining transfer functions that result in the rendering of corresponding images depicting various details from the volume. We show that by using this feature as an optimization objective, the generation of high quality images can be automated. Using simple genetic algorithms, we can automatically generate sets of images illustrating coherent, easily-distinguishable and high-quality surfaces of relevant structures from volume data.
Open Access
Conference paper
International Conference on System Theory
Anastasios Gkagkas; Davide Alessandro Coccomini; Gylfi Þór Guðmundsson; Jakub Lokoč; Jiaxin Wu; Nick Pantelidis; Nicola Messina; Rahel Arnold; Silvan Heller; Vera Benz; Werner Bailer;
CERTH; Charles University; City University Hong Kong; ISTI-CNR; Joanneum Research; Reykjavik University; University of Basel;
Different task interpretations are a highly undesired element in interactive video retrieval evaluations. When a participating team focuses partially on a wrong goal, the evaluation results might become partially misleading. In this paper, we propose a process for refining known-item and open-set type queries, and preparing the assessors that judge the correctness of submissions to open-set queries. Our findings from recent years reveal that a proper methodology can lead to objective query quality improvements and subjective participant satisfaction with query clarity.
Open Access
Conference paper
Conference on Multimedia Retrieval
David Renaudie; Georgios N. Yannakakis; Konstantinos Makantasis; Kosmas Pinitas; Matthew Barthet Mike Thomsen;
Massive Entertainment Ubisoft University of Malta
This paper introduces a large scale multimodal corpus collected for the purpose of analysing and predicting player engagement in commercial-standard games. The corpus is solicited from 25 players of the action role-playing game Tom Clancy’s The Division 2, who annotated their level of engagement using a time-continuous annotation tool. The cleaned and processed corpus presented in this paper consists of nearly 20 hours of annotated gameplay videos accompanied by logged gamepad actions. We report preliminary results on predicting long-term player engagement based on in-game footage and game controller actions using Convolutional Neural Network architectures. Results obtained suggest we can predict the player engagement with up to accuracy on average ( at best) when we fuse information from the game footage and the player’s controller input. Our findings validate the hypothesis that long-term (i.e. 1 hour of play) engagement can be predicted efficiently solely from pixels and gamepad actions.
Open Access
Paper
Conference on Multimodal Interaction
Evlampios Apostolidis; Georgios Balaouras; Ioannis Patras; Vasileios Mezaris
CERTH; Queen Mary University of London;
This paper presents a new reinforcement-based method for video thumbnail selection (called RL-DiVTS), that relies on estimates of the aesthetic quality, representativeness and visual diversity of a small set of selected frames, made with the help of tailored reward functions. The proposed method integrates a novel diversity-aware Frame Picking mechanism that performs a sequential frame selection and applies a reweighting process to demote frames that are visually-similar to the already selected ones. Experiments on two benchmark datasets (OVP and YouTube), using the top-3 matching evaluation protocol, show the competitiveness of RL-DiVTS against other SoA video thumbnail selection and summarization approaches from the literature.
Open Access
Paper
IEEE International Conference on Image Processing
Cristian-Nicolae Butincu; Florin Leon; Lavinia-Eugenia Ferariu; Marius Gavrilescu;
Technical University of Iasi
This report describes our research and documentation efforts in searching and analyzing the related literature for existing applications of evolutionary algorithms for quality-oriented optimization. We present our findings in terms of multiple relevant results from the related state of the art. We mainly divide the results into two broad categories: classic single- and multi-objective optimization, and quality-diversity (QD) methods. While we mostly focus on evolutionary optimization applied in visualization and image-processing, we also present some results from other fields which we considered relevant. This report was originally submitted as documentation for the deliverables of the VolEvol project.
Open Access
Report
N/A
Ali Najm; Antonios Liapis; Despina Michael-Grigoriou; Emmanouil Xylakis; Georgios N. Yannakakis;
Cyprus University of Technology; University of Malta
Open Access
Conference paper
N/A
Ioanna Valsamara; Ioannis Pitas;
Aristotle University of Thessaloniki;
Open Access
Preprint
N/A
Andreas Sochopoulos; Evangelos Charalampakis; Ioannis Mademlis; Ioannis Pitas; Sotirios Papadopoulos
Aristotle University of Thessaloniki;
Open Access
Preprint
N/A
Dimitrios Papaioannou; Ioannis Pitas; Vasileios Mygdalis
Aristotle University of Thessaloniki;
Open Access
Preprint
N/A
Anestis Kaimakamadis; Ioannis Pitas;
Aristotle University of Thessaloniki;
Open Access
Preprint
N/A
Ioanna Valsamara; Ioannis Mademlis; Ioannis Pitas;
Aristotle University of Thessaloniki;
Open Access
Preprint
N/A
Emmanouil Krasanakis; Ioanna Valsamara; Ioannis Mademlis; Ioannis Pitas;
Aristotle University of Thessaloniki;
Open Access
Preprint
N/A
Hervé Le Borgne; Michel Crucianu; Nicolas Audebert; Perla Doubinsky
CNAM; Université Paris-Saclay;
The latent space of GANs contains rich semantics reflecting the training data. Different methods propose to learn edits in latent space corresponding to semantic attributes, thus allowing to modify generated images. Most supervised methods rely on the guidance of classifiers to produce such edits. However, classifiers can lead to out-of-distribution regions and be fooled by adversarial samples. We propose an alternative formulation based on the Wasserstein loss that avoids such problems, while maintaining performance on-par with classifier-based approaches. We demonstrate the effectiveness of our method on two datasets (digits and faces) using StyleGAN2.
Open Access
Conference paper
N/A
Alejandro Moreo; Fabrizio Sebastiani; Mirko Bunse; Pablo González
Consiglio Nazionale delle Ricerche; University of Oviedo
Open Access
Book
N/A
Florin Leon; Marius Gavrilescu;
Technical University of Iasi
The study of hurricanes through information visualization and visual analysis is useful for tracking and understanding the behavior and impact of such hazardous natural phenomena. Images obtained from data commonly acquired through meteorological radar provide scientists with a visual representation of the storm’s characteristics, such as its location, size, and intensity. Such information is useful for forecasting, decision making in disaster management and environmental and human health risk assessment. Visual representations of such phenomena can help emergency responders and policymakers make informed decisions about evacuations, disaster response, and resource allocation. In this context, we propose an automated means of generating representations from complex 3D datasets obtained from meteorological radar scans of regions affected by hurricanes, illustrating the geometry and spatial features of such phenomena.
Open Access
Conference paper
International Conference on Environmental Engineering and Management
Ioannis Mademlis; Ioannis Pitas; Michail Kaseris
Aristotle University of Thessaloniki;
Open Access
Preprint
N/A
Axel Roebel; Lenny Renault; Rémi Mignot;
Sorbonne Université
Open Access
Journal article
Journal of the Audio Engineering Society
Hannes Fassold;
Joanneum Research;
Manifold learning is an emerging research domain of machine learning. In this work, we give an introduction into manifold learning and how it is employed for important application fields in multimedia.
Open Access
Conference paper
Conference on Video and Signal Processing
Claudio Gennaro; Fabrizio Falchi; Gaetano Emanuele Valenti; Giuseppe Amato; Luca Ciampi; Nicola Messina;
ISTI-CNR; University of Pisa
Open Access
Conference paper
Conference on Image Analysis and Processing
Antonino Furnari; Claudio Gennaro; Fabrizio Falchi; Giovanni Maria Farinella; Nicola Messina;
ISTI-CNR; University of Catania;
Open Access
Journal article
Conference on Image Analysis and Processing
Axel Roebel; Lenny Renault; Rémi Mignot;
Sorbonne Université
Open Access
Conference paper
International Conference on Digital Audio Effects
Bruno Lepri; Linchao Bao; Marco de Nadai; Nicu Sebe; Yahui Liu Yajing Chen;
FBK; Tencent AI Lab University of Trento;
Closed Access
Journal article
IEEE Transactions on Multimedia
Marius Gavrilescu;
Technical University of Iasi
Objective quality assessment in volume visualization is a crucial process aimed at quantifying the quality of rendered volumetric images or animations using measurable metrics and algorithms. This approach is essential to ensure that the visualizations accurately represent the underlying data and meet specific quality standards. The assessment of quality in computer graphics, visualization and image processing is a complex task, particularly due to the number of scenarios, use cases and problems encountered in the aforementioned fields, and also due to the subjective nature of quality. To this extent, we search for methods, algorithms and metrics that can be used by an optimizer to search for rendering parameters such that the resulting images adhere to our formulations on what constitutes quality. At the same time, similar metrics can be exploited such that the space of possible parameters can be more thoroughly explored, resulting in populations of images exhibiting diverse content. This document presents our findings in terms of approaches that constitute good candidates for quality and diversity criteria, to be used as objectives and/or for defining feature spaces when automatically generating images from volume data. This report was originally submitted as documentation for the deliverables of the VolEvol project.
Open Access
Report
N/A
Antonios Liapis; Chintan Triverdi; Emmanouil Xylakis; Georgios N. Yannakakis; Konstantinos Makantasis; Kosmas Pinitas; Matthew Barthet
University of Malta
Open Access
Conference paper
Conference on Affective Computing and Intelligent Interaction Workshops and Demos
Christina Katsini; George E. Raptis; Vasilis Theodorou;
Human Opsis
In a fast-changing media ecosystem, professionals and enterprises in the News and Media industry face new challenges that they should address to maximize their productivity and improve their services. The rise of alternative news sources, such as social media, the leading news source, especially for young people, has led to emerging requirements in the News and Media industry. A core requirement is publishing articles as fast as possible on various platforms, combining visual and textual content. Accompanying news with images raises the readers’ interest, improves engagement, and recall. Therefore, the News and Media industry professionals must adapt their publication strategies to meet this requirement and the media consumers’ expectations. However, the selection of the appropriate images is a time-consuming and manual task. Towards this direction, we propose VIREO, which addresses this challenge by providing professionals (e.g., journalists) with an integrated digital solution that automatically recommends a collection of images that could accompany an article. VIREO implements text and image analysis and matching processes leveraging AI techniques in real time to achieve this. VIREO aims to benefit both professionals (e.g., journalists) by suggesting appealing images that accompany the textual content of their articles and create breath-taking stories and the media consumers (e.g., readers) by delivering an enhanced reading experience, engagement, and recall.
Open Access
Conference paper
Human-Computer Interaction
Ambrish Rawat; Gabriele Picco; Giulio Zizzo; Myles Foley; Taesung Lee; Yufang Hou;
IBM Research; Imperial College London;
The wide applicability and adaptability of large language models (LLM) has enabled their rapid adoption. While the pre-trained models can perform many tasks, such models are often fine-tuned to improve their performance. However, this leads to issues over violation of model licenses, model theft, and copyright infringement. Moreover, recent advances show that generative technology is capable of producing harmful content which exacerbates the problems of accountability within model supply chains. Thus, we need a method to investigate how a model was trained or a piece of text was generated and what their source pre-trained model was. In this paper we take a first step to addressing this open problem by tracing back the origin of a given fine-tuned LLM to its corresponding pre-trained base model. We consider different knowledge levels and attribution strategies, and find that we are able to trace back to the original base model with an AUC of 0.804.
Open Access
Conference paper
N/A
Ioannis Patras; Zengqun Zhao;
Queen Mary University of London;
Open Access
Conference paper
N/A
Aaron Duane; Cathal Gurrin; Florian Spiess; Jakub Lokoč; Klaus Schoeffmann; Konstantin Schall; Ladislav Peška; Loris Sauter; Luca Rossetto; Lucia Vadicamo; Nicola Messina; Omar Shahbaz Khan; Stefanos Vrochidis; Stelios Andreadis; Thao-Nhu Nguyen; Werner Bailer; Zhixin Ma;
CERTH; Charles University; Dublin City University; HTW Berlin; ISTI-CNR; IT University of Copenhagen; Joanneum Research; Klagenfurt University; Singapore Management University; University of Basel; University of Copenhagen; University of Zurich;
This paper presents the findings of the eleventh Video Browser Showdown competition, where sixteen teams competed in known-item and ad-hoc search tasks. Many of the teams utilized state-of-the-art video retrieval approaches that demonstrated high effectiveness in challenging search scenarios. In the paper, a broad survey of all utilized approaches is presented in connection
with an analysis of the performance of participating teams. Specifically, both high-level performance indicators are presented with overall statistics as well as an in-depth analysis of the performance of selected tools implementing result set logging. The analysis reveals evidence that the CLIP model represents a versatile tool for cross-modal video retrieval when combined with interactive search capabilities. Furthermore, the analysis investigates the effect of different users and text query properties on the performance in search tasks. Last but not least, lessons learned from search task preparation are presented, and a new direction for adhoc search based tasks at Video Browser Showdown is introduced.
Open Access
Journal article
N/A
Ioannis Pitas; Vasileios Mygdalis
Aristotle University of Thessaloniki;
This work examines the problem of increasing the robustness of deep neural network-based image classification systems to adversarial attacks, without changing the neural architecture or employ adversarial examples in the learning process. We attribute their famous lack of robustness to the geometric properties of the deep neural network embedding space, derived from standard optimization options, which allow minor changes in the intermediate activation values to trigger dramatic changes to the decision values in the final layer. To counteract this effect, we explore optimization criteria that supervise the distribution of the intermediate embedding spaces, in a class-specific basis, by introducing and leveraging one-class classification objectives. The proposed learning procedure compares favorably to recently proposed training schemes for adversarial robustness in black-box adversarial attack settings.
Open Access
Conference paper
N/A
Alexandros Zamichos; Ioannis Pitas; Vasileios Mygdalis
Aristotle University of Thessaloniki;
Adversarial attacks in image classification are optimization problems that estimate the minimum perturbation required for a single input image, so the neural network misclassifies it. Universal adversarial perturbations are adversarial attacks that target a whole dataset, estimated by e.g., accumulating the perturbations for each image using standard adversarial attacks. This work treats the universal adversarial perturbation as a problem of transformation estimation. As such, we propose to learn an iterative transformation that maps “clean” images to a “perturbed” domain, by exploiting adversarial attacks. Our experiments show that the proposed formulation leads to easy generation of the adversarial perturbation, while it introduces less noise in the perturbed images, when compared to the state-of-the-art. Finally, this formulation allows us to explore additional properties, notably reversibility of the transformation and attainability of the transformation by using dataset samples.
Open Access
Conference paper
N/A
Ioannis Pitas; Stefania Altini; Vasileios Mygdalis
Aristotle University of Thessaloniki;
Different adversarial attack methods have been proposed in the literature, mainly focusing on attack efficiency and visual quality, e.g., similarity with the non-adversarial examples. These properties enable the use of adversarial attacks for privacy protection against automated classification systems, while maintaining utility for human users. In this paradigm, when privacy restrictions are lifted, access to the original data should be restored, for all stakeholders. This paper addresses exactly this problem. Existing adversarial attack methods cannot reconstruct the original data from the adversarial ones, leading to significant storage overhead for all privacy applications. To solve this issue, we propose AdvRevGAN, a novel Neural Network architecture that generates reversible adversarial examples. We evaluate our approach in classification problems, where we examine the case where adversarial attacks are constructed by a neural network, while the original images are reconstructed using the reverse transformation from the adversarial examples. We show that adversarial attacks using this approach maintain and even increase their efficiency, while the classification accuracy of the model in the reconstructed data can almost totally be restored.
Open Access
Conference paper
N/A
Daniel Aláez; Ioannis Pitas; Jesús Villadangos; Vasileios Mygdalis
Aristotle University of Thessaloniki; University of Navarre;
In recent years, the field of automated aerial cinematography has seen a significant increase in demand for real-time 3D target geopositioning for motion and shot planning. To this end, many of the existing cinematography plans require the use of complex sensors that need to be equipped on the subject or rely on external motion systems. This work addresses this problem by combining monocular visual target detection and tracking with a simple ground intersection model. Under the assumption that the targets to be filmed typically stand on the ground, 3D target localization is achieved by estimating the direction and the norm of the look-at vector. The proposed algorithm employs an error estimation model that accounts for the error in detecting the bounding box, the height estimation errors, and the uncertainties of the pitch and yaw angles. This algorithm has been fully implemented in a heavy-lifting aerial cinematography hexacopter, and its performance has been evaluated through experimental flights. Results show that typical errors are within 5 meters of absolute distance and 3 degrees of angular error for distances to the target of around 100 meters.
Open Access
Conference paper
N/A
Christos Tzelepis; Georgios Tzimiropoulos; Ioannis Patras; Stella Bounareli; Vasileios Argyriou;
Kingston University London; Queen Mary University of London;
Open Access
Conference paper
N/A
Adrian Popescu; Armelle Brun; Evan Dufraisse; Jérôme Deshayes-Chossart; Julien Tourille;
Université de Lorraine; Université Paris-Saclay;
Target-dependent sentiment classification (TSC) enables a fine-grained automatic analysis of sentiments expressed in texts.
Sentiment expression varies depending on the domain, and it is necessary to create domain-specific datasets.
While socially important, TSC in the news domain remains relatively understudied.
We introduce MAD-TSC, the first multilingual aligned dataset designed for TSC in news. MAD-TSC differs substantially from existing resources.
First, it includes aligned examples in eight languages to facilitate a comparison of performance for individual languages, and a direct comparison of human and machine translation.
Second, the dataset is sampled from a diversified parallel news corpus, and is diversified in terms of news sources and geographic spread of entities.
Finally, MAD-TSC is more challenging than existing datasets because its samples are more complex.
We exemplify the use of MAD-TSC with comprehensive monolingual and multilingual experiments.
The latter shows that machine translations can successfully replace manual ones, and that performance for all included languages can match that of English by automatically translating test examples.
Open Access
Conference paper
Conference on Computational Linguistics N/A
Daniele Ugo Leonzio; Luca Cuccovillo; Marco Marcon; Paolo Bolettieri; Patrick Aichroth; Stefano Tubaro;
Fraunhofer IDMT; Politecnico di Milano;
In recent years, the multimedia forensic community has put a great effort in developing solutions to assess the integrity and authenticity of multimedia objects, focusing especially on manipulations applied by means of advanced deep learning techniques. However, in addition to complex forgeries as the deepfakes, very simple yet effective manipulation techniques not involving any use of state-of-the-art editing tools still exist and prove dangerous. This is the case of audio splicing for speech signals, i.e., to concatenate and combine multiple speech segments obtained from different recordings of a person in order to cast a new fake speech. Indeed, by simply adding a few words to an existing speech we can completely alter its meaning. In this work, we address the overlooked problem of detection and localization of audio splicing from different models of acquisition devices. Our goal is to determine whether an audio track under analysis is pristine, or it has been manipulated by splicing one or multiple segments obtained from different device models. Moreover, if a recording is detected as spliced, we identify where the modification has been introduced in the temporal dimension. The proposed method is based on a Convolutional Neural Network (CNN) that extracts model-specific features from the audio recording. After extracting the features, we determine whether there has been a manipulation through a clustering algorithm. Finally, we identify the point where the modification has been introduced through a distance-measuring technique. The proposed method allows to detect and localize multiple splicing points within a recording.
Open Access
Journal article
Multimedia FORensics in the WILD
Alberto Del Bimbo; Federico Becattini; Lorenzo Seidenari; Luca Cultrera;
University of Florence;
Autonomous driving is advancing at a fast pace, with driving algorithms becoming more and more accurate and reliable. Despite this, it is of utter importance to develop models that can offer a certain degree of explainability in order to be trusted, understood and accepted by researchers and, especially, society. In this work we present a conditional imitation learning agent based on a visual attention mechanism in order to provide visually explainable decisions by design. We propose different variations of the method, relying on end-to-end trainable regions proposal functions, generating regions of interest to be weighed by an attention module. We show that visual attention can improve driving capabilities and provide at the same time explainable decisions.
Open Access
Journal article
N/A
Nicu Sebe; Wei Wang; Yue Song
Beijing Jiaotong University; University of Trento;
Closed Access
Journal article
IEEE Transaction on Pattern Analysis and Machine Intelligence
Fang Li; Jing Wang; Jun Zhang; Wengjing Li; Zhongcheng Wu; Zhun Zhong
Chinese Academy of Science; University of Trento;
Closed Access
Journal article
Transactions on Intelligent Transportation Systems;
Andy Keller; Max Welling; Nicu Sebe; Yue Song
University of Amsterdam; University of Trento;
Open Access
Conference paper
International Conference on Machine Learning
Emmanouil Krasanakis; Ioannis Kompatsiaris; Symeon Papadopoulos
CERTH;
Open Access
Journal article
N/A
Bin Ren; Hao Tang; Nicu Sebe; Wei Wang; Xia Li; Yiming Wang;
Beijing Jiaotong University; ETH Zurich; FBK; University of Trento;
For semantic-guided cross-view image translation, it is crucial to learn where to sample pixels from the source view image and where to reallocate them guided by the target view semantic map, especially when there is little overlap or drastic view difference between the source and target images. Hence, one not only needs to encode the longrange dependencies among pixels in both the source view image and target view semantic map but also needs to translate these learned dependencies. To this end, we propose a novel generative adversarial network, PI-Trans, which mainly consists of a novel Parallel-ConvMLP module and an Implicit Transformation module at multiple semantic levels. Extensive experimental results show that PI-Trans achieves the best qualitative and quantitative performance by a large margin compared to the state-of-the-art methods on two challenging datasets. The source code is available at https://github.com/Amazingren/PI-Trans.
Open Access
Conference paper
N/A
Chen Feng; Ioannis Patras;
Queen Mary University of London;
Deep learning has achieved great success in recent years with the aid of advanced neural network structures and large-scale human-annotated datasets. However, it is often costly and difficult to accurately and efficiently annotate large-scale datasets, especially for some specialized domains where fine-grained labels are required. In this setting, coarse labels are much easier to acquire as they do not require expert knowledge. In this work, we propose a contrastive learning method, called masked contrastive learning (MaskCon) to address the under-explored problem setting, where we learn with a coarse-labelled dataset in order to address a finer labelling problem. More specifically, within the contrastive learning framework, for each sample our method generates soft-labels with the aid of coarse labels against other samples and another augmented view of the sample in question. By contrast to self-supervised contrastive learning where only the sample’s augmentations are considered hard positives, and in supervised contrastive learning where only samples with the same coarse labels are considered hard positives, we propose soft labels based on sample distances, that are masked by the coarse labels. This allows us to utilize both inter-sample relations and coarse labels. We demonstrate that our method can obtain as special cases many existing state-of-the-art works and that it provides tighter bounds on the generalization error. Experimentally, our method achieves significant improvement over the current state-of-the-art in various datasets, including CIFAR10, CIFAR100, ImageNet-1K, Standford Online Products and Stanford Cars196 datasets. Code and annotations are available at https://github.com/ MrChenFeng/MaskCon_CVPR2023.
Open Access
Conference paper
N/A
Christos Tzelepis; Giorgios Kordopatis-Zilos; Giorgios Tolias; Ioannis Kompatsiaris; Ioannis Patras; Symeon Papadopoulos
CERTH; Czech Technical University in Prague; Queen Mary University of London;
We introduce S2VS, a video similarity learning approach with self-supervision. Self-Supervised Learning (SSL) is typically used to train deep models on a proxy task so as to have strong transferability on target tasks after fine-tuning. Here, in contrast to prior work, SSL is used to perform video similarity learning and address multiple retrieval and detection tasks at once with no use of labeled data. This is achieved by learning via instance-discrimination with task-tailored augmentations and the widely used InfoNCE loss together with an additional loss operating jointly on self-similarity and hard-negative similarity. We benchmark our method on tasks where video relevance is defined with varying granularity, ranging from video copies to videos depicting the same incident or event. We learn a single universal model that achieves state-of-the-art performance on all tasks, surpassing previously proposed methods that use labeled data. The code and pretrained models are publicly available at: https://github.com/gkordo/s2vs.
Open Access
Conference paper
N/A
Chen Feng; Ioannis Patras;
Queen Mary University of London;
Open Access
Conference paper
N/A
Fabrizio Falchi; Jan Sedmidubsky; Nicola Messina; Tomás Rebok;
ISTI-CNR; Masaryk University;
Open Access
Conference paper
N/A
Giorgios Kordopatis-Zilos; Ioannis Kompatsiaris; Pantelis Dogoulis; Symeon Papadopoulos
CERTH;
New advancements for the detection of synthetic images are critical for fighting disinformation, as the capabilities of generative AI models continuously evolve and can lead to hyper-realistic synthetic imagery at unprecedented scale and speed. In this paper, we focus on the challenge of generalizing across different concept classes, e.g., when training a detector on human faces and testing on synthetic animal images — highlighting the ineffectiveness of existing approaches that randomly sample generated images to train their models. By contrast, we propose an approach based on the premise that the robustness of the detector can be enhanced by training it on realistic synthetic images that are selected based on their quality scores according to a probabilistic quality estimation model. We demonstrate the effectiveness of the proposed approach by conducting experiments with generated images from two seminal architectures, StyleGAN2 and Latent Diffusion, and using three different concepts for each, so as to measure the cross-concept generalization ability. Our results show that our quality-based sampling method leads to higher detection performance for nearly all concepts, improving the overall effectiveness of the synthetic image detectors.
Open Access
Conference paper
N/A
Adrian Popescu; Bogdad Ionescu; Giorgios Kordopatis-Zilos; Luca Cuccovillo; Symeon Papadopoulos
CERTH; Czech Technical University in Prague; Fraunhofer IDMT; Université Paris-Saclay; University Politehnica of Bucharest
With recent advancements in synthetic media manipulation and generation, verifying multimedia content posted online has become increasingly difficult. Additionally, the malicious exploitation of AI technologies by actors to disseminate disinformation on social media, and more generally the Web, at an alarming pace poses significant threats to society and democracy. Therefore, the development of AI-powered tools that facilitate media verification is urgently needed. The MAD ’23 workshop aims to bring together individuals working on the wider topic of detecting disinformation in multimedia to exchange their experiences and discuss innovative ideas, attracting people with varying backgrounds and expertise. The research areas of interest include identifying manipulated and synthetic content in multimedia, as well as examining the dissemination of disinformation and its impact on society. The multimedia aspect is very important since content most often contains a mix of modalities and their joint analysis can boost the performance of verification methods.
Open Access
Conference paper
Conference on Multimedia Retrieval
Adrian Popescu; Bogdad Ionescu; Giorgios Kordopatis-Zilos; Luca Cuccovillo; Symeon Papadopoulos
CERTH; Czech Technical University in Prague; Fraunhofer IDMT; Université Paris-Saclay; University Politehnica of Bucharest
Front matter of the proceedings of the 2nd ACM International Workshop on Multimedia AI against Disinformation, held in Thessaloniki (Greece) on June 12th, 2023. The full proceedings are available online at https://doi.org/10.1145/3591106.
Open Access
Book section
International Workshop on Multimedia AI against Disinformation
Claudio Gennaro; Claudio Vairo; Fabio Carrara; Fabrizio Falchi; Giuseppe Amato; Lucia Vadicamo; Nicola Messina; Paolo Bolettieri;
ISTI-CNR;
VISIONE is a large-scale video retrieval system that integrates multiple search functionalities, including free text search, spatial color and object search, visual and semantic similarity search, and temporal search. The system leverages cutting-edge AI technology for visual analysis and advanced indexing techniques to ensure scalability. As demonstrated by its runner-up position in the 2023 Video Browser Showdown competition, VISIONE effectively integrates these capabilities to provide a comprehensive video retrieval solution. A system demo is available online, showcasing its capabilities on over 2300 hours of diverse video content (V3C1+V3C2 dataset) and 12 hours of highly redundant content (Marine dataset). The demo can be accessed at https://visione.isti.cnr.it.
Open Access
Conference paper
Conference on Multimedia Retrieval
Lucile Sassatelli; Quentin Guimard
Université Côte d'Azur;
Adaptive bitrate (ABR) algorithms are used in streaming media to adjust video or audio quality based on the viewer’s network conditions to provide a smooth playback experience. With the rise of virtual reality (VR) headsets, 360° video streaming is growing rapidly and requires efficient ABR strategies to also adapt the video quality to the user’s head position. However, research in this field is often difficult to compare due to a lack of reproducible simulations. To address this problem, we provide SMART360, a 360° streaming simulation environment to compare motion prediction and adaptive bitrates strategies.
We provide sample inputs and baseline algorithms along with the simulator, as well as examples of results and visualizations that can be obtained with SMART360. The code and data are made publicly available.
Open Access
Conference paper
ACM Multimedia Systems Conference
Juanjuan Weng; Nicu Sebe; Shaozi Li; Zhiming Luo; Zhun Zhong
University of Trento; Xiamen University
Open Access
Transactions on Information Forensics and Security
Hao Tang;
ETH Zurich; Tencent AI Lab University of Oregon;
Open Access
Conference paper
Computer Vision and Pattern Recognition
Nan Pu; Nicu Sebe; Zhun Zhong
University of Trento;
Open Access
Conference paper
Computer Vision and Pattern Recognition
Bin Ren; Nicu Sebe; Rita Cucchiara; Wei Bi; Wei Wang; Yahui Liu Yue Song
Beijing Jiaotong University; Tencent AI Lab University of Modena and Reggio Emilia; University of Trento;
Open Access
Conference paper
Computer Vision and Pattern Recognition
Boyu Wang; Charles Ling; Nicu Sebe; Wei Wang; Weijie Wang; Xi Chen; Zhun Zhong
Huawei Noah's Ark Lab; University of Trento; Western University;
Open Access
Conference paper
Computer Vision and Pattern Recognition
Antonios Liapis; Edoardo Tibuzzi; Georgios N. Yannakakis; Jeg Dudley; Joel Hilmersson; Konstantinos Sfikas;
AKT II; University of Malta
Computer-aided optimization algorithms in structural engineering have historically focused on the structural performance of generated forms, often resulting in the selection of a single ‘optimal’ solution. However, diversity of generated solutions is desirable when those solutions are shown to a human user to choose from. Quality-Diversity (QD) search is an emerging field of Evolutionary Computation which can automate the exploration of the solution space in engineering problems. QD algorithms, such as MAP-Elites, operate by maintaining and expanding an archive of diverse solutions, optimising for quality in local niches of a multidimensional design space. The generated archive of solutions can help engineers gain a better overview of the solution space, illuminating which designs are possible and their trade-offs. In this paper we apply Quality Diversity search to the problem of designing shell structures. Since the design of shell structures comes with physical constraints, we leverage a constrained optimization variant of the MAP-Elites algorithm, FI-MAP-Elites. We implement our proposed methodology within the Rhino/Grasshopper environment and use the Karamba Finite Element Analysis solver for all structural engineering calculations. We test our method on case studies of parametric models of shell structures that feature varying complexity. Our experiments investigate the algorithm’s ability to illuminate the solution space and generate feasible and high-quality solutions.
Open Access
Conference paper
N/A
Johan Oomen; Philo van Kemenade; Rasa Bocyte
Netherlands Institute for Sound & Vision
Segments of audiovisual content are constantly being reinterpreted as they are reused and repeated in new contexts. Framing analysis can reveal patterns and biases in the way content is being recontextualised in the media to shape public discourse. In the AI4Media project, the Netherlands Institute for Sound & Vision has been investigating how AI-based tools could support humanities scholars in performing framing analysis across large-scale audiovisual collections. This short paper describes a demo of the Partial Audio Matching (PAM) functionality designed for this purpose. It describes how PAM has been integrated into the CLARIAH Media Suite – a virtual research space for humanities scholars that enables the exploration and analysis of audiovisual collections.
Open Access
Report
N/A
Adrian Popescu; Hugo Schindler; Jérôme Deshayes-Chossart; Van-Khoa Nguyen
Université Paris-Saclay; University of Geneva;
Online social networks use AI techniques to automatically infer profiles from users’ shared data.
However, these inferences and their effects remain, to a large extent, opaque to the users themselves.
We propose a method which raises user awareness about the potential use of their profiles in impactful situations, such as searching for a job or an accommodation.
These situations illustrate usage contexts that users might not have anticipated when deciding to share their data.
User photographic profiles are described by automatic object detections in profile photos, and associated object ratings in situations.
Human ratings of the profiles per situation are also available for training.
These data are represented as graph structures which are fed into graph neural networks in order to learn how to automatically rate them.
An adaptation of the learning procedure per situation is proposed since the same profile is likely to be interpreted differently, depending on the context.
Automatic profile ratings are compared to one another in order to inform individual users of their standing with respect to others.
Our method is evaluated on a public dataset, and consistently outperforms competitive baselines.
An ablation study gives insights about the role of its main components.
Open Access
Conference paper
Conference on Multimedia Retrieval
Claudio Gennaro; Fabio Carrara; Fabrizio Falchi; Giuseppe Amato; Nicola Messina;
ISTI-CNR;
Open Access
Conference paper
Conference on Image Analysis and Processing
Hristiana Krasteva; Irina Temnikova; Ivo Dzhumerov; Ruslana Margova; Tsvetelina Stefanova; Veneta Kireva;
GATE Institute;
Automatically detecting disinformation is an important Natural Language Processing (NLP) task whose results can assist journalists and the general public. The European Commission defines “disinformation” as “false or misleading content that is spread with an intention to deceive”. Deception and thus disinformation can be identified by the presence of (psycho)linguistic markers, but some lower-resourced languages (e.g. Bulgarian) lack sufficient linguistic and psycholinguistic research on this topic, lists of such markers and suitable datasets. This article introduces the first ever resources for studying and detecting deception and disinformation in Bulgarian (some of which can be adapted to other languages). The resources can benefit linguists, psycholinguists and NLP researchers, are accessible on Zenodo (subject to legal conditions) and include: 1) an extended hierarchical classification of linguistic markers signalling deception; 2) lists of Bulgarian expressions for recognizing some of the linguistic markers; 3) four large Bulgarian social media datasets on topics related to deception, not fact-checked, but automatically annotated with the markers; 4) Python scripts to automatically collect, clean, anonymize, and annotate new Bulgarian texts. The datasets can be used to build machine learning methods or study potential deception. The article describes the methods of collecting and processing the datasets and linguistic markers, and presents some statistics.
Open Access
Conference paper
Language & Technology Conference
Hao Tang; Nicu Sebe; Philip Torr;
ETH Zurich; University of Oxford; University of Trento;
Closed Access
Journal article
IEEE Transaction on Pattern Analysis and Machine Intelligence
Dan Xu; Guolei Sun; Hao Tang; Luc van Gool; Nicu Sebe; Radu Timofte; Xiaojuan Qi;
ETH Zurich; HKUST; University of Hong Kong; University of Trento; University of Wurzburg;
We propose a novel edge guided generative adversarial network with contrastive learning (ECGAN) for the challenging semantic image synthesis task. Although considerable improvement has been achieved, the quality of synthesized images is far from satisfactory due to three largely unresolved challenges. 1) The semantic labels do not provide detailed structural information, making it difficult to synthesize local details and structures. 2) The widely adopted CNN operations such as convolution, down-sampling, and normalization usually cause spatial resolution loss and thus cannot fully preserve the original semantic information, leading to semantically inconsistent results (e.g., missing small objects). 3) Existing semantic image synthesis methods focus on modeling “local” semantic information from a single input semantic layout. However, they ignore “global” semantic information of multiple input semantic layouts, i.e., semantic cross-relations between pixels across different input layouts. To tackle 1), we propose to use edge as an intermediate representation which is further adopted to guide image generation via a proposed attention guided edge transfer module. Edge information is produced by a convolutional generator and introduces detailed structure information. To tackle 2), we design an effective module to selectively highlight class-dependent feature maps according to the original semantic layout to preserve the semantic information. To tackle 3), inspired by current methods in contrastive learning, we propose a novel contrastive learning method, which aims to enforce pixel embeddings belonging to the same semantic class to generate more similar image content than those from different classes. Doing so can capture more semantic relations by explicitly exploring the structures of labeled pixels from multiple input semantic layouts. Experiments on three challenging datasets show that our ECGAN achieves significantly better results than state-of-the-art methods.
Open Access
Conference paper
International Conference on Learning Representations
Dan Xu; Guolei Sun; Hao Tang; Luc van Gool; Nicu Sebe; Radu Timofte; Xiaojuan Qi;
ETH Zurich; HKUST; University of Hong Kong; University of Trento; University of Wurzburg;
We propose a novel edge guided generative adversarial network with contrastive learning (ECGAN) for the challenging semantic image synthesis task. Although considerable improvement has been achieved, the quality of synthesized images is far from satisfactory due to three largely unresolved challenges. 1) The semantic labels do not provide detailed structural information, making it difficult to synthesize local details and structures. 2) The widely adopted CNN operations such as convolution, down-sampling, and normalization usually cause spatial resolution loss and thus cannot fully preserve the original semantic information, leading to semantically inconsistent results (e.g., missing small objects). 3) Existing semantic image synthesis methods focus on modeling “local” semantic information from a single input semantic layout. However, they ignore “global” semantic information of multiple input semantic layouts, i.e., semantic cross-relations between pixels across different input layouts. To tackle 1), we propose to use edge as an intermediate representation which is further adopted to guide image generation via a proposed attention guided edge transfer module. Edge information is produced by a convolutional generator and introduces detailed structure information. To tackle 2), we design an effective module to selectively highlight class-dependent feature maps according to the original semantic layout to preserve the semantic information. To tackle 3), inspired by current methods in contrastive learning, we propose a novel contrastive learning method, which aims to enforce pixel embeddings belonging to the same semantic class to generate more similar image content than those from different classes. Doing so can capture more semantic relations by explicitly exploring the structures of labeled pixels from multiple input semantic layouts. Experiments on three challenging datasets show that our ECGAN achieves significantly better results than state-of-the-art methods.
Open Access
Conference paper
International Conference on Learning Representations
Bin Ren; Hao Tang; Nicu Sebe; Wei Wang; Xia Li; Yiming Wang;
Beijing Jiaotong University; ETH Zurich; FBK; University of Trento;
For semantic-guided cross-view image translation, it is crucial to learn where to sample pixels from the source view image and where to reallocate them guided by the target view semantic map, especially when there is little overlap or drastic view difference between the source and target images. Hence, one not only needs to encode the longrange dependencies among pixels in both the source view image and target view semantic map but also needs to translate these learned dependencies. To this end, we propose a novel generative adversarial network, PI-Trans, which mainly consists of a novel Parallel-ConvMLP module and an Implicit Transformation module at multiple semantic levels. Extensive experimental results show that PI-Trans achieves the best qualitative and quantitative performance by a large margin compared to the state-of-the-art methods on two challenging datasets. The source code is available at https://github.com/Amazingren/PI-Trans.
Closed Access
Conference paper
Speech and Signal Processing
Carlos Santiago; Claudio Gennaro; Giuseppe Amato; João Paulo Costeira; Luca Ciampi;
Instituto Superior Técnico; ISTI-CNR;
Video violence detection is a subset of human action recognition aiming to detect violent behaviors in trimmed video clips. Current Computer Vision solutions based on Deep Learning approaches provide astonishing results. However, their success relies on large collections of labeled datasets for supervised learning to guarantee that they generalize well to diverse testing scenarios. Although plentiful annotated data may be available for some pre-specified domains, manual annotation is unfeasible for every ad-hoc target domain or task. As a result, in many real-world applications, there is a domain shift between the distributions of the train (source) and test (target) domains, causing a significant drop in performance at inference time. To tackle this problem, we propose an Unsupervised Domain Adaptation scheme for video violence detection based on single image classification that mitigates the domain gap between the two domains. We conduct experiments considering as the source labeled domain some datasets containing violent/non-violent clips in general contexts and, as the target domain, a collection of videos specific for detecting violent actions in public transport, showing that our proposed solution can improve the performance of the considered models.
Open Access
Conference paper
Conference on Image Processing and Vision Engineering
Antonios Liapis; David Melhart; Georgios N. Yannakakis; Paris Mavromoustakos-Blom; Pieter Spronck; Sander Bakkes;
Tilburg University; University of Malta Utrecht University;
Games are designed to elicit strong emotions during game play, especially when players are competing against each other. Artificial Intelligence applied to predict a player’s emotions has mainly been tested on single-player experiences in low-stakes settings and short-term interactions. How do players experience and manifest affect in high-stakes competitions, and which modalities can capture this? This paper reports a first experiment in this line of research, using a competition of the video game Hearthstone where both competing players’ game play and facial expressions were recorded over the course of the entire match which could span up to 41 minutes. Using two experts’ annotations of tension using a continuous video affect annotation tool, we attempt to predict tension from the webcam footage of the players alone. Treating both the input and the tension output in a relative fashion, our best models reach 66.3% average accuracy (up to 79.2% at the best fold) in the challenging leave-one-participant out cross-validation task. This initial experiment shows a way forward for affect annotation in games “in the wild” in high-stakes, real-world competitive settings.
Open Access
Conference paper
Conference on the Foundations of Digital Games
Antonios Liapis; Georgios N. Yannakakis; Konstantinos Makantasis; Kosmas Pinitas;
University of Malta
How can we reliably transfer affect models trained in controlled laboratory conditions ( in-vitro ) to uncontrolled real-world settings ( in-vivo )? The information gap between in-vitro and in-vivo applications defines a core challenge of affective computing. This gap is caused by limitations related to affect sensing including intrusiveness, hardware malfunctions and availability of sensors. As a response to these limitations, we introduce the concept of privileged information for operating affect models in real-world scenarios (in the wild). Privileged information enables affect models to be trained across multiple modalities available in a lab, and ignore, without significant performance drops, those modalities that are not available when they operate in the wild. Our approach is tested in two multimodal affect databases one of which is designed for testing models of affect in the wild. By training our affect models using all modalities and then using solely raw footage frames for testing the models, we reach the performance of models that fuse all available modalities for both training and testing. The results are robust across both classification and regression affect modeling tasks which are dominant paradigms in affective computing. Our findings make a decisive step towards realizing affect interaction in the wild.
Open Access
Journal article
IEEE Transactions on Affective Computing
Georgios Tzimiropoulos; Ioannis Maniadis Mataxas; Ioannis Patras;
Queen Mary University of London;
Clustering has been a major research topic in the field of machine learning, one to which Deep Learning has recently been applied with significant success. However, an aspect of clustering that is not addressed by existing deep clustering methods, is that of efficiently producing multiple, diverse partitionings for a given dataset. This is particularly important, as a diverse set of base clusterings are necessary for consensus clustering, which has been found to produce better and more robust results than relying on a single clustering. To address this gap, we propose DivClust, a diversity controlling loss that can be incorporated into existing deep clustering frameworks to produce multiple clusterings with the desired degree of diversity. We conduct experiments with multiple datasets and deep clustering frameworks and show that: a) our method effectively controls diversity across frameworks and datasets with very small additional computational cost, b) the sets of clusterings learned by DivClust include solutions that significantly outperform single-clustering baselines, and c) using an off-the-shelf consensus clustering algorithm, DivClust produces consensus clustering solutions that consistently outperform single-clustering baselines, effectively improving the performance of the base deep clustering framework.
Open Access
Conference paper
N/A
Dan Xu; Hao Tang; Hong Liu; Nicu Sebe; Philip Torr;
ETH Zurich; Hong Kong University of Science and Technology; Peking University; University of Oxford; University of Trento;
State-of-the-art methods in the image-to-image translation are capable of learning a mapping from a source domain to a target domain with unpaired image data. Though the existing methods have achieved promising results, they still produce visual artifacts, being able to translate low-level information but not high-level semantics of input images. One possible reason is that generators do not have the ability to perceive the most discriminative parts between the source and target domains, thus making the generated images low quality. In this article, we propose a new Attention-Guided Generative Adversarial Networks (AttentionGAN) for the unpaired image-to-image translation task. AttentionGAN can identify the most discriminative foreground objects and minimize the change of the background. The attention-guided generators in AttentionGAN are able to produce attention masks, and then fuse the generation output with the attention masks to obtain high-quality target images. Accordingly, we also design a novel attention-guided discriminator which only considers attended regions. Extensive experiments are conducted on several generative tasks with eight public datasets, demonstrating that the proposed method is effective to generate sharper and more realistic images compared with existing competitive models. The code is available at https://github.com/Ha0Tang/AttentionGAN.
Closed Access
Journal article
IEEE Transactions on Neural Networks and Learning Systems
Hong Liu; Nicu Sebe; Sin'ichi Satoh; Zhun Zhong
National Institute of Informatics of Tokyo; University of Trento;
Overfitting in adversarial training has attracted the interest of researchers in the community of artificial intelligence and machine learning in recent years. To address this issue, in this paper we begin by evaluating the defense performances of several calibration methods on various robust models. Our analysis and experiments reveal two intriguing properties: 1) a well-calibrated robust model is decreasing the confidence of robust model; 2) there is a trade-off between the confidences of natural and adversarial images. These new properties offer a straightforward insight into designing a simple but effective regularization, called Self-Residual-Calibration (SRC). The proposed SRC calculates the absolute residual between adversarial and natural logit features corresponding to the ground-truth labels. Furthermore, we utilize the pinball loss to minimize the quantile residual between them, resulting in more robust regularization. Extensive experiments indicate that our SRC can effectively mitigate the overfitting problem while improving the robustness of state-of-the-art models. Importantly, SRC is complementary to various regularization methods. When combined with them, we are capable of achieving the top-rank performance on the AutoAttack benchmark leaderboard.
Closed Access
Journal article
Artificial Intelligence
Antonios Liapis; Georgios N. Yannakakis; Konstantinos Sfikas;
University of Malta
This paper introduces a user-driven evolutionary algorithm based on Quality Diversity (QD) search. During a design session, the user iteratively selects among presented alternatives and their selec- tions affect the upcoming results. We implement a variation of the MAP-Elites algorithm where the presented alternatives are sampled from a small region (window) of the behavioral space. After a user selection, the window is centered on the selected individual’s be- havior characterization, evolution selects parents from within this window to produce offspring, and new alternatives are sampled. Essentially we define an adaptive system of local QD search, where the user’s selections guide the search towards specific regions of the behavioral space. The system is tested on the generation of architectural layouts, a constrained optimization task, leveraging QD search through a two-archive approach.
Open Access
Conference paper
Genetic and Evolutionary Computation Conference
Antonios Liapis; Georgios N. Yannakakis; Konstantinos Sfikas;
University of Malta
This paper introduces a user-driven evolutionary algorithm based on Quality Diversity (QD) search. During a design session, the user iteratively selects among presented alternatives and their selec- tions affect the upcoming results. We implement a variation of the MAP-Elites algorithm where the presented alternatives are sampled from a small region (window) of the behavioral space. After a user selection, the window is centered on the selected individual’s be- havior characterization, evolution selects parents from within this window to produce offspring, and new alternatives are sampled. Essentially we define an adaptive system of local QD search, where the user’s selections guide the search towards specific regions of the behavioral space. The system is tested on the generation of architectural layouts, a constrained optimization task, leveraging QD search through a two-archive approach.
Open Access
Conference paper
Genetic and Evolutionary Computation Conference
Alejandro Moreo; Fabrizio Sebastiani; Mirko Bunse; Pablo González
ISTI-CNR; University of Applied Sciences and Art Dortmund; University of Oviedo
Open Access
Journal article
SIGKDD Explorations
Hristiana Nikolaeva; Irina Temnikova; Ivo Dzhumerov; Silvia Gargova;
GATE Institute; Plovdic University;
Automatic Language Identification (LI) is a widely addressed task, but not all users (for example linguists) have the means or interest to develop their own tool or to train the existing ones with their own data. There are several off-the-shelf LI tools, but for some languages, it is unclear which tool is the best for specific types of text. This article presents a comparison of the performance of several off-the-shelf language identification tools on Bulgarian social media data. The LI tools are tested on a multilingual Twitter dataset (composed of 2966 tweets) and an existing Bulgarian Twitter dataset on the topic of fake content detection of 3350 tweets. The article presents the manual annotation procedure of the first dataset, a dis- cussion of the decisions of the two annotators, and the results from testing the 7 off-the-shelf LI tools on both datasets. Our findings show that the tool, which is the easiest for users with no programming skills, achieves the highest F1-Score on Bulgarian social media data, while other tools have very useful functionalities for Bulgarian social media texts.
Open Access
Conference paper
Conference on Computational Linguistics
Fabio Carrara; Giuseppe Amato; Jan Sedmidubsky;
ISTI-CNR; Masaryk University;
Recent progress in pose-estimation methods enables the extraction of sufficiently-precise 3D human skeleton data from ordinary videos, which offers great opportunities for a wide range of applications. However, such spatio-temporal data are typically extracted in the form of a continuous skeleton sequence without any information about semantic segmentation or annotation. To make the extracted data reusable for further processing, there is a need to access them based on their content. In this paper, we introduce a universal retrieval approach that compares any two skeleton sequences based on temporal order and similarities of their underlying segments. The similarity of segments is determined by their content-preserving low-dimensional code representation that is learned using the Variational AutoEncoder principle in an unsupervised way. The quality of the proposed representation is validated in retrieval and classification scenarios; our proposal outperforms the state-of-the-art approaches in effectiveness and reaches speed-ups up to 64x on common skeleton sequence datasets.
Open Access
Conference paper
N/A
Alberto Del Bimbo; Federico Pernici; Matteo Bruni; Niccolò Biondi
University of Florence;
Compatible features enable the direct comparison of old and new learned features allowing to use them interchangeably over time. In visual search systems, this eliminates the need to extract new features from the gallery-set when the representation model is upgraded with novel data. This has a big value in real applications as re-indexing the gallery-set can be computationally expensive when the gallery-set is large, or even infeasible due to privacy or other concerns of the application. In this paper, we propose CoReS, a new training procedure to learn representations that are compatible with those previously learned, grounding on the stationarity of the features as provided by fixed classifiers based on polytopes. With this solution, classes are maximally separated in the representation space and maintain their spatial configuration stationary as new classes are added, so that there is no need to learn any mappings between representations nor to impose pairwise training with the previously learned model.
We demonstrate that our training procedure largely outperforms the current state of the art and is particularly effective in the case of multiple upgrades of the training-set, which is the typical case in real applications.
Open Access
Journal article
IEEE Transaction on Pattern Analysis and Machine Intelligence
Hao Tang; Ling Shao; Nicu Sebe; Philip Torr;
ETH Zurich; Terminus AI Lab; University of Oxford; University of Trento;
We present a novel bipartite graph reasoning Generative Adversarial Network (BiGraphGAN) for two challenging tasks: person pose and facial image synthesis. The proposed graph generator consists of two novel blocks that aim to model the pose-to-pose and pose-to-image relations, respectively. Specifically, the proposed bipartite graph reasoning (BGR) block aim to reason the long-range cross relations between the source and target pose in a bipartite graph, which mitigates some of the challenges caused by pose deformation. Moreover, we propose a new interaction-and-aggregation (IA) block to effectively update and enhance the feature representation capability of both a person’s shape and appearance in an interactive way. To further capture the change in pose of each part more precisely, we propose a novel part-aware bipartite graph reasoning (PBGR) block to decompose the task of reasoning the global structure transformation with a bipartite graph into learning different local transformations for different semantic body/face parts. Experiments on two challenging generation tasks with three public datasets demonstrate the effectiveness of the proposed methods in terms of objective quantitative scores and subjective visual realness. The source code and trained models are available at https://github.com/Ha0Tang/BiGraphGAN.
Closed Access
Journal article
International Journal of Computer Vision
Bin Ren; Hao Tang; Lei Ding; Nicu Sebe; Paolo Rota; Songsong Wu;
ETH Zurich; Guangdong University of Petrochemical Technology; University of Trento;
Video processing and analysis have become an urgent task since a huge amount of videos (e.g., Youtube, Hulu) are uploaded online every day. The extraction of representative key frames from videos is very important in video processing and analysis since it greatly reduces computing resources and time. Although great progress has been made recently, large-scale video classification remains an open problem, as the existing methods have not well balanced the performance and eiciency simultaneously. To tackle this problem, this work presents an unsupervised method to retrieve the key frames, which combines Convolutional Neural Network (CNN) and Temporal Segment Density Peaks Clustering (TSDPC). The proposed TSDPC is a generic and powerful framework and it has two advantages compared with previous works, one is that it can calculate the number of key frames automatically. The other is that it can preserve the temporal information of the video. Thus it improves the eiciency of video classiication. Furthermore, a Long Short-Term Memory network (LSTM) is added on the top of the CNN to further elevate the performance of classiication. Moreover, a weight fusion strategy of diferent input networks is presented to boost the performance. By optimizing both video classiication and key frame extraction simultaneously, we achieve better classiication performance and higher eiciency.We evaluate our method on two popular datasets (i.e., HMDB51 and UCF101) and the experimental results consistently demonstrate that our strategy achieves competitive performance and eficiency compared with the state-of-the-art approaches.
Closed Access
Journal article
ACM Transactions on Multimedia Computing, Communications, and Applications
Christos Tzelepis; Ioannis Patras; Nicu Sebe; Simone Barattin;
Queen Mary University of London; University of Trento;
This work addresses the problem of anonymizing the identity of faces in a dataset of images, such that the privacy of those depicted is not violated, while at the same time the dataset is useful for downstream task such as for training machine learning models. To the best of our knowledge, we are the first to explicitly address this issue and deal with two major drawbacks of the existing state-of-the-art approaches, namely that they (i) require the costly training of additional, purpose-trained neural networks, and/or (ii) fail to retain the facial attributes of the original images in the anonymized counterparts, the preservation of which is of paramount importance for their use in downstream tasks. We accordingly present a task-agnostic anonymization procedure that directly optimises the images’ latent representation in the latent space of a pre-trained GAN. By optimizing the latent codes directly, we ensure both that the identity is of a desired distance away from the original (with an identity obfuscation loss), whilst preserving the facial attributes (using a novel feature-matching loss in FaRL’s deep feature space). We demonstrate through a series of both qualitative and quantitative experiments that our method is capable of anonymizing the identity of the images whilst–crucially–better-preserving the facial attributes.
Open Access
Conference paper
N/A
Adam Cygan; Agnieszka Szczesna; Bartosz Bizón; Dominik Golba; Elzbieta Macioszeck; Luca Ciampi; Michal Cogiel; Michal Staniszewski; Nicola Messina; Pawel Foszner;
Blees; ISTI-CNR; QSystem.pro; Silesian University of Technology
Data scarcity has become one of the main obstacles to developing supervised models based on Artificial Intelligence in Computer Vision. Indeed, Deep Learning-based models systematically struggle when applied in new scenarios never seen during training and may not be adequately tested in non-ordinary yet crucial real-world situations. This paper presents and publicly releases CrowdSim2, a new synthetic collection of images suitable for people and vehicle detection gathered from a simulator based on the Unity graphical engine. It consists of thousands of images gathered from various synthetic scenarios resembling the real world, where we varied some factors of interest, such as the weather conditions and the number of objects in the scenes. The labels are automatically collected and consist of bounding boxes that precisely localize objects belonging to the two object classes, leaving out humans from the annotation pipeline. We exploited this new benchmark as a testing ground for some state-of-the-art detectors, showing that our simulated scenarios can be a valuable tool for measuring their performances in a controlled environment.
Open Access
Conference paper
N/A
Adam Cygan; Agnieszka Szczesna; Bartosz Bizón; Dominik Golba; Elzbieta Macioszeck; Luca Ciampi; Michal Cogiel; Michal Staniszewski; Nicola Messina; Pawel Foszner;
Blees; ISTI-CNR; QSystem.pro; Silesian University of Technology
Generally, crowd datasets can be collected or generated from real or synthetic sources. Real data is generated by using infrastructure-based sensors (such as static cameras or other sensors). The use of simulation tools can significantly reduce the time required to generate scenario-specific crowd datasets, facilitate data-driven research, and next build functional machine learning models. The main goal of this work was to develop an extension of crowd simulation (named CrowdSim2) and prove its usability in the application of people-tracking algorithms. The simulator is developed using the very popular Unity 3D engine with particular emphasis on the aspects of realism in the environment, weather conditions, traffic, and the movement and models of individual agents. Finally, three methods of tracking were used to validate generated dataset: IOU-Tracker, Deep-Sort, and Deep-TAMA.
Open Access
Conference paper
N/A
Adrien Depeursinge; Davide Calvaresi; Henning Müller; John O. Prior; José Pereira Amorim; Katerina Yordanova; Lidia Dutklewicz; Lode Lauwaert; Mara Graziani; Mor Vered; Pedro Henriques Abreu; Rahul Nair; Tobias Blanke; Valeria Pulignano; Vincent Andrearczy; Wessel Reijers;
European University Institute; Faculty of Social Science of Leuven; IPO - Porto Research Centre; Lausanne University Hospital; University of Amsterdam; University of Applied Sciences of Western Switzerland; University of Coimbra; University of Geneva;
Since its emergence in the 1960s, Artificial Intelligence (AI) has grown to conquer many technology products and their fields of application. Machine learning, as a major part of the current AI solutions, can learn from the data and through experience to reach high performance on various tasks. This growing success of AI algorithms has led to a need for interpretability to understand opaque models such as deep neural networks. Vari- ous requirements have been raised from different domains, together with numerous tools to debug, justify outcomes, and establish the safety, fairness and reliability of the mod- els. This variety of tasks has led to inconsistencies in the terminology with, for instance, terms such as interpretable, explainable and transparent being often used interchange- ably in methodology papers. These words, however, convey different meanings and are “weighted” differently across domains, for example in the technical and social sciences. In this paper, we propose an overarching terminology of interpretability of AI systems that can be referred to by the technical developers as much as by the social sciences community to pursue clarity and efficiency in the definition of regulations for ethical and reliable AI development. We show how our taxonomy and definition of interpretable AI differ from the ones in previous research and how they apply with high versatility to several domains and use cases, proposing a—highly needed—standard for the communication among inter- disciplinary areas of AI.
Open Access
Journal article
N/A
Adrian Popescu; David Picard; Grégoire Petit; Hugo Schindler;
Université Gustave Eiffel; Université Paris-Saclay;
Exemplar-free class-incremental learning is very challenging due to the negative effect of catastrophic forgetting. A balance between stability and plasticity of the incremental process is needed in order to obtain good accuracy for past as well as new classes. Existing exemplar-free class-incremental methods focus either on successive fine tuning of the model, thus favoring plasticity, or on using a feature extractor fixed after the initial incremental state, thus favoring stability. We introduce a method which combines a fixed feature extractor and a pseudo-features generator to improve the stability-plasticity balance. The generator uses a simple yet effective geometric translation of new class features to create representations of past classes, made of pseudo-features. The translation of features only requires the storage of the centroid representations of past classes to produce their pseudo-features. Actual features of new classes and pseudo-features of past classes are fed into a linear classifier which is trained incrementally to discriminate between all classes. The incremental process is much faster with the proposed method compared to mainstream ones which update the entire deep model. Experiments are performed with three challenging datasets, and different incremental settings. A comparison with ten existing methods shows that our method outperforms the others in most cases. FeTrIL code is available at https://github.com/GregoirePetit/FeTrIL.
Open Access
Conference paper
N/A
Christos Tzelepis; Ioannis Pitas; James Oldfield; Mihalis Nicolaou; Yannis Panagakis
Cyprus Institute; Queen Mary University of London; University of Athens
Recent advances in the understanding of Generative Adversarial Networks (GANs) have led to remarkable progress in visual editing and synthesis tasks, capitalizing on the rich semantics that are embedded in the latent spaces of pre-trained GANs. However, existing methods are often tailored to specific GAN architectures and are limited to either discovering global semantic directions that do not facilitate localized control, or require some form of supervision through manually provided regions or segmentation masks. In this light, we present an architecture-agnostic approach that jointly discovers factors representing spatial parts and their appearances in an entirely unsupervised fashion. These factors are obtained by applying a semi-nonnegative tensor factorization on the feature maps, which in turn enables context-aware local image editing with pixel-level control. In addition, we show that the discovered appearance factors correspond to saliency maps that localize concepts of interest, without using any labels. Experiments on a wide range of GAN architectures and datasets show that, in comparison to the state of the art, our method is far more efficient in terms of training time and, most importantly, provides much more accurate localized control. Our code is available at https://github.com/james-oldfield/PandA.
Open Access
Conference paper
International Conference on Learning Representations
Gim Hee Lee; Nicu Sebe; Yuyang Zhao; Zhiming Luo; Zhun Zhong
National University of Singapore; University of Trento; Xiamen University
In this work, we introduce a new concept, named source-free open compound domain adaptation (SF-OCDA), and study it in semantic segmentation. SF-OCDA is more challenging than the traditional domain adaptation but it is more practical. It jointly considers (1) the issues of data privacy and data storage and (2) the scenario of multiple target domains and unseen open domains. In SF-OCDA, only the source pre-trained model and the target data are available to learn the target model. The model is evaluated on the samples from the target and unseen open domains. To solve this problem, we present an effective framework by separating the training process into two stages: (1) pre-training a generalized source model and (2) adapting a target model with self-supervised learning. In our framework, we propose the Cross-Patch Style Swap (CPSS) to diversify samples with various patch styles in the feature-level, which can benefit the training of both stages. First, CPSS can significantly improve the generalization ability of the source model, providing more accurate pseudo-labels for the latter stage. Second, CPSS can reduce the influence of noisy pseudo-labels and also avoid the model overfitting to the target domain during selfsupervised learning, consistently boosting the performance on the target and open domains. Experiments demonstrate that our method produces state-of-the-art results on the C-Driving dataset. Furthermore, our model also achieves the leading performance on CityScapes for domain generalization.
Open Access
Journal article
IEEE Transactions on Circuits and Systems for Video Technology
Hao Tang; Mengyi Zhao; Nicu Sebe; Wei Wang; Yue Song
ETH Zurich; University of Trento;
Modern saliency detection models are based on the encoder-decoder framework and they use different strategies to fuse the multi-level features between the encoder and decoder to boost representation power. Motivated by recent work in implicit modelling, we propose to introduce an implicit function to simulate the equilibrium state of the feature pyramid at infinite depths. We question the existence of the ideal equilibrium and thus propose a quasi-equilibrium model by taking the first-order derivative into the black-box root solver using Taylor expansion. It models more realistic convergence states and significantly improves the network performance. We also propose a differentiable edge extractor that directly extracts edges from the saliency masks. By optimizing the extracted edges, the generated saliency masks are naturally optimized on contour constraints and the non-deterministic predictions are removed. We evaluate the proposed methodology on five public datasets and extensive experiments show that our method achieves new state-of-the-art performances on six metrics across datasets.
Closed Access
Journal article
IEEE Transactions on Image Processing
Hao Tang; Nicu Sebe; Wei Wang; Yue Song
ETH Zurich; University of Trento;
Salient object detection has been long studied to identify the most visually attractive objects in images/videos. Recently, a growing number of approaches have been proposed all of which rely on the contour/edge information to improve detection performance. The edge labels are either put into the loss directly or used as extra supervision. The edge and body can also be learned separately and then fused afterward. Both methods either lead to high prediction errors near the edge or cannot be trained in an end-to-end manner. Another problem is that existing methods may fail to detect objects of various sizes due to the lack of efficient and effective feature fusion mechanisms. In this work, we propose to decompose the saliency detection task into two cascaded sub-tasks, i.e., detail modelling and body illing. Specifically, the detail modelling focuses on capturing the object edges by supervision of explicitly decomposed detail label that consists of the pixels that are nested on the edge and near the edge. Then the body illing learns the body part which will be illed into the detail map to generate more accurate saliency map. To effectively fuse the features and handle objects at different scales, we have also proposed two novel multi-scale detail attention and body attention blocks for precise detail and body modelling. Experimental results show that our method achieves state-of-the-art performances on six public datasets.
Open Access
Journal article
ACM Multimedia Systems Conference
Cristiano Saltori; Elisa Ricci; Fabio Poiesi; Guofeng Mei; Jian Zhang; Nicu Sebe; Qiang Wu
University of Technology Sydney; University of Trento; Vision Lab Fondazione
Unsupervised learning on 3D point clouds has undergone a rapid evolution, especially thanks to data augmentation-based contrastive methods. However, data augmentation is not ideal as it requires a careful selection of the type of augmentations to perform, which in turn can affect the geometric and semantic information learned by the network during selftraining. To overcome this issue, we propose an augmentation-free unsupervised approach for point clouds to learn transferable point-level features via soft clustering, named SoftClu. SoftClu assumes that the points belonging to a cluster should be close to each other in both geometric and feature spaces. This differs from typical contrastive learning, which builds similar representations for a whole point cloud and its augmented versions. We exploit the affiliation of points to their clusters as a proxy to enable self-training through a pseudo-label prediction task. Under the constraint that these pseudo-labels induce the equipartition of the point cloud, we cast SoftClu as an optimal transport problem. We formulate an unsupervised loss to minimize the standard cross-entropy between pseudolabels and predicted labels. Experiments on downstream applications, such as 3D object classification, part segmentation, and semantic segmentation, show the effectiveness of our framework in outperforming state-of-the-art techniques.
Open Access
Conference paper
British Machine Vision Conference
Nicu Sebe; Wei Wang; Yue Song
Beijing Jiaotong University; University of Trento;
The task of out-of-distribution (OOD) detection is crucial for deploying machine learning models in real-world settings. In this paper, we observe that the singular value distributions of the in-distribution (ID) and OOD features are quite different: the OOD feature matrix tends to have a larger dominant singular value than the ID feature, and the class predictions of OOD samples are largely determined by it. This observation motivates us to propose RankFeat, a simple yet effective post hoc approach for OOD detection by removing the rank-1 matrix composed of the largest singular value and the associated singular vectors from the high-level feature. RankFeat achieves state-of-the-art performance and reduces the average false positive rate (FPR95) by 17.90% compared with the previous best method. Extensive ablation studies and comprehensive theoretical analyses are presented to support the empirical results.
Open Access
Conference paper
Conference on Neural Information Processing Systems
Gim Hee Lee; Nicu Sebe; Yuyang Zhao; Zhun Zhong
National University of Singapore; University of Trento;
In this paper, we consider the problem of domain generalization in semantic segmentation, which aims to learn a robust model using only labeled synthetic (source) data. The model is expected to perform well on unseen real (target) domains. Our study finds that the image style variation can largely influence the model’s performance and the style features can be well represented by the channel-wise mean and standard deviation of images. Inspired by this, we propose a novel adversarial style augmentation (AdvStyle) approach, which can dynamically generate hard stylized images during training and thus can effectively prevent the model from overfitting on the source domain. Specifically, AdvStyle regards the style feature as a learnable parameter and updates it by adversarial training. The learned adversarial style feature is used to construct an adversarial image for robust model training. AdvStyle is easy to implement and can be readily applied to different models. Experiments on two synthetic-to-real semantic segmentation benchmarks demonstrate that AdvStyle can significantly improve the model performance on unseen real domains and show that we can achieve the state of the art. Moreover, AdvStyle can be employed to domain generalized image classification and produces a clear improvement on the considered datasets.
Open Access
Conference paper
Conference on Neural Information Processing Systems
Gim Hee Lee; Na Zhao; Nicu Sebe; Yuyang Zhao; Zhun Zhong
National University of Singapore; University of Trento;
In this paper, we study the task of synthetic-to-real domain generalized semantic segmentation, which aims to learn a model that is robust to unseen real-world scenes using only synthetic data. The large domain shift between synthetic and real-world data, including the limited source environmental variations and the large distribution gap between synthetic and real-world data, significantly hinders the model performance on unseen real-world scenes. In this work, we propose the Style-HAllucinated Dual consistEncy learning (SHADE) framework to handle such domain shift. Specifically, SHADE is constructed based on two consistency constraints, Style Consistency (SC) and Retrospection Consistency (RC). SC enriches the source situations and encourages the model to learn consistent representation across style-diversified samples. RC leverages real-world knowledge to prevent the model from overfitting to synthetic data and thus largely keeps the representation consistent between the synthetic and real-world models. Furthermore, we present a novel style hallucination module (SHM) to generate style-diversified samples that are essential to consistency learning. SHM selects basis styles from the source distribution, enabling the model to dynamically generate diverse and realistic samples during training. Experiments show that our SHADE yields significant improvement and outperforms state-of-the-art methods by 5.05% and 8.35% on the average mIoU of three real-world datasets on single- and multi-source settings, respectively.
Open Access
Conference paper
European Conference on Computer Vision
Andrea Pilzer; Arno Solin; Elisa Ricci; Juho Kannala; Martin Trapp; Nicu Sebe; Subhankar Roy;
Aalto University; Fondazione Bruno Kessler; NVIDIA; University of Trento;
Source-free domain adaptation (SFDA) aims to adapt a classifier to an unlabelled target data set by only using a pre-trained source model. However, the absence of the source data and the domain shift makes the predictions on the target data unreliable. We propose quantifying the uncertainty in the source model predictions and utilizing it to guide the target adaptation. For this, we construct a probabilistic source model by incorporating priors on the network parameters inducing a distribution over the model predictions. Uncertainties are estimated by employing a Laplace approximation and incorporated to identify target data points that do not lie in the source manifold and to down-weight them when maximizing the mutual information on the target data. Unlike recent works, our probabilistic treatment is computationally lightweight, decouples source training and target adaptation, and requires no specialized source training or changes of the model architecture. We show the advantages of uncertainty-guided SFDA over traditional SFDA in the closed-set and open-set settings and provide empirical evidence that our approach is more robust to strong domain shifts even without tuning.
Open Access
Conference paper
N/A
Nicu Sebe; Wei Wang; Yue Song
University of Trento;
EigenDecomposition (ED) is at the heart of many computer vision algorithms and applications. One crucial bottleneck limiting its usage is the expensive computation cost, particularly for a mini-batch of matrices in the deep neural networks. In this paper, we propose a QR-based ED method dedicated to the application scenarios of computer vision. Our proposed method performs the ED entirely by batched matrix/vector multiplication, which processes all the matrices simultaneously and thus fully utilizes the power of GPUs. Our technique is based on the explicit QR iterations by Givens rotation with double Wilkinson shifts. With several acceleration techniques, the time complexity of QR iterations is reduced from O(n5) to O(n3). The numerical test shows that for small and medium batched matrices (e.g., dim<32) our method can be much faster than the Pytorch SVD function. Experimental results on visual recognition and image generation demonstrate that our methods also achieve competitive performances
Open Access
Conference paper
European Conference on Computer Vision
Nicu Sebe; Wei Wang; Yue Song
University of Trento;
Inserting an SVD meta-layer into neural networks is prone to make the covariance ill-conditioned, which could harm the model in the training stability and generalization abilities. In this paper, we systematically study how to improve the covariance conditioning by enforcing orthogonality to the Pre-SVD layer. Existing orthogonal treatments on the weights are first investigated. However, these techniques can improve the conditioning but would hurt the performance. To avoid such a side effect, we propose the Nearest Orthogonal Gradient (NOG) and Optimal Learning Rate (OLR). The effectiveness of our methods is validated in two applications: decorrelated Batch Normalization (BN) and Global Covariance Pooling (GCP). Extensive experiments on visual recognition demonstrate that our methods can simultaneously improve the covariance conditioning and generalization. Moreover, the combinations with orthogonal weight can further boost the performances.
Open Access
Conference paper
European Conference on Computer Vision
Elisa Ricci; Mingxuan Liu; Nicu Sebe; Subhankar Roy; Zhun Zhong
Fondazione Bruno Kessler; University of Trento;
We study the new task of class-incremental Novel Class Discovery (class-iNCD), which refers to the problem of discovering novel categories in an unlabelled data set by leveraging a pre-trained model that has been trained on a labelled data set containing disjoint yet related categories. Apart from discovering novel classes, we also aim at preserving the ability of the model to recognize previously seen base categories. Inspired by rehearsal-based incremental learning methods, in this paper we propose a novel approach for class-iNCD which prevents forgetting of past information about the base classes by jointly exploiting base class feature prototypes and feature-level knowledge distillation. We also propose a self-training clustering strategy that simultaneously clusters novel categories and trains a joint classifier for both the base and novel classes. This makes our method able to operate in a class-incremental setting. Our experiments, conducted on three common benchmarks, demonstrate that our method significantly outperforms state-of-the-art approaches. Code is available at https://github.com/OatmealLiu/class-iNCD.
Open Access
Conference paper
European Conference on Computer Vision
Andrea Esuli; Fabrizio Falchi; Giuseppe Amato; Nicola Messina;
ISTI-CNR;
Image-text matching is an interesting and fascinating task in modern AI research. Despite the evolution of deep-learning-based image and text processing systems, multi-modal matching remains a challenging problem. In this work, we consider the problem of accurate image-text matching for the task of multi-modal large-scale information retrieval. State-of-the-art results in image-text matching are achieved by inter-playing image and text features from the two different processing pipelines, usually using mutual attention mechanisms. However, this invalidates any chance to extract separate visual and textual features needed for later indexing steps in large-scale retrieval systems. In this regard, we introduce the Transformer Encoder Reasoning Network (TERN), an architecture built upon one of the modern relationship-aware self-attentive architectures, the Transformer Encoder (TE). This architecture is able to separately reason on the two different modalities and to enforce a final common abstract concept space by sharing the weights of the deeper transformer layers. Thanks to this design, the implemented network is able to produce compact and very rich visual and textual features available for the successive indexing step. Experiments are conducted on the MS-COCO dataset, and we evaluate the results using a discounted cumulative gain metric with relevance computed exploiting caption similarities, in order to assess possibly non-exact but relevant search results. We demonstrate that on this metric we are able to achieve state-of-the-art results in the image retrieval task. Our code is freely available at https://github.com/mesnico/TERN
Open Access
Conference paper
N/A
Claudio Gennaro; Claudio Vairo; Fabio Carrara; Fabrizio Falchi; Giuseppe Amato; Lucia Vadicamo; Nicola Messina; Paolo Bolettieri;
ISTI-CNR;
In this paper, we present the fourth release of VISIONE, a tool for fast and effective video search on a large-scale dataset. It includes several search functionalities like text search, object and color-based search, semantic and visual similarity search, and temporal search. VISIONE uses ad-hoc textual encoding for indexing and searching video content, and it exploits a full-text search engine as search backend. In this new version of the system, we introduced some changes both to the current search techniques and to the user interface.
Open Access
Conference paper
N/A
Hannes Fassold; Werner Bailer;
Joanneum Research;
In order to support common annotation tasks in visual media production and archiving, we propose two datasets which cover the annotation of the bustle of a scene (i.e., populated to unpopulated), the cinematographic type of a shot as well as the time of day and season of a shot. The dataset for bustle and shot type, called People@Places, adds annotations to the Places365 dataset, and the ToDY (time of day/year) dataset adds annotations to the SkyFinder dataset. For both datasets, we provide a toolchain to create automatic annotations, which have been manually verified and corrected for parts of the two datasets. We provide baseline results for these tasks using the EfficientNet-B3 model, pretrained on the Places365 dataset.
Open Access
Conference paper
MultiMedia Modeling
Adrian Popescu; Céline Hudelot; Eva Feillet; Grégoire Petit; Marina Reyboz
Université Grenoble Alpes; Université Gustave Eiffel; Université Paris-Saclay;
Open Access
Conference paper
Winter Conference on Applications of Computer Vision
Hao Tang; Ling Shao; Nicu Sebe; Philip Torr;
ETH Zurich; University of Oxford; University of Trento;
In this paper, we address the task of semantic-guided image generation. One challenge common to most existing image-level generation methods is the difficulty in generating small objects and detailed local textures. To address this, in this work we consider generating images using local context. As such, we design a local class-specific generative network using semantic maps as guidance, which separately constructs and learns subgenerators for different classes, enabling it to capture finer details. To learn more discriminative class-specific feature representations for the local generation, we also propose a novel classification module. To combine the advantages of both global image-level and local class-specific generation, a joint generation network is designed with an attention fusion module and a dual-discriminator structure embedded. Lastly, we propose a novel semantic-aware upsampling method, which has a larger receptive field and can take far-away pixels that are semantically related for feature upsampling, enabling it to better preserve semantic consistency for instances with the same semantic labels. Extensive experiments on two image generation tasks show the superior performance of the proposed method. State-of-the-art results are established by large margins on both tasks and on nine challenging public benchmarks. The source code and trained models are available at https://github.com/Ha0Tang/LGGAN.
Closed Access
Journal article
IEEE Transaction on Pattern Analysis and Machine Intelligence
Artem Yaroshchuk; Luca Cuccovillo; Malte Baum; Patrick Aichroth;
Fraunhofer IDMT;
In this paper we present a novel approach for environment classification for speech recordings, which does not require the selection of decaying reverberation tails. It is based on a multi-band RT60 analysis of blind channel estimates and achieves an accuracy of up to 93.8% on test recordings derived from the ACE corpus.
Open Access
Conference paper
Transactions on Information Forensics and Security
Evlampios Apostolidis; Georgios Balaouras; Ioannis Patras; Vasileios Mezaris
CERTH; Queen Mary University of London;
In this paper we propose a method for explaining video summarization. We start by formulating the problem as the creation of an explanation mask which indicates the parts of the video that influenced the most the estimates of a video summarization network, about the frames’ importance. Then, we explain how the typical analysis pipeline of attention-based networks for video summarization can be used to define explanation signals, and we examine various attention-based signals that have been studied as explanations in the NLP domain. We evaluate the performance of these signals by investigating the video summarization network’s input-output relationship according to different replacement functions, and utilizing measures that quantify the capability of explanations to spot the most and least influential parts of a video. We run experiments using an attention-based network (CA-SUM) and two datasets (SumMe and TVSum) for video summarization. Our evaluations indicate the advanced performance of explanations formed using the inherent attention weights, and demonstrate the ability of our method to explain the video summarization results using clues about the focus of the attention mechanism.
Open Access
Conference paper
IEEE International Symposium on Multimedia
Fabio Valerio Massoli; Fabrizio Falchi; Giuseppe Amato; Lucia Vadicamo;
ISTI-CNR;
In recent years, Quantum Computing witnessed massive improvements in terms of available resources and algorithms development. The ability to harness quantum phenomena to solve computational problems is a long-standing dream that has drawn the scientific community’s interest since the late 80s. In such a context, we propose our contribution. First, we introduce basic concepts related to quantum computations, and then we explain the core functionalities of technologies that implement the Gate Model and Adiabatic Quantum Computing paradigms. Finally, we gather, compare and analyze the current state-of-the-art concerning Quantum Perceptrons and Quantum Neural Networks implementations.
Open Access
Journal article
ACM Computing Surveys
Chen Feng; Ioannis Patras;
Queen Mary University of London;
Self-supervised learning has recently achieved great success in representation learning without human annotations. The dominant method – that is contrastive learning, is generally based on instance discrimination tasks, i.e., individual samples are treated as independent categories. However, presuming all the samples are different contradicts the natural grouping of similar samples in common visual datasets, e.g., multiple views of the same dog. To bridge the gap, this paper proposes an adaptive method that introduces soft inter-sample relations, namely Adaptive Soft Contrastive Learning (ASCL). More specifically, ASCL transforms the original instance discrimination task into a multi-instance soft discrimination task, and adaptively introduces inter-sample relations. As an effective and concise plug-in module for existing self-supervised learning frameworks, ASCL achieves the best performance on several benchmarks in terms of both performance and efficiency. Code is available at https://github.com/MrChenFeng/ASCL_ICPR2022.
Open Access
Conference paper
N/A
Fabio Carrara; Fabrizio Falchi; Giuseppe Amato; Roberto Caldelli;
ISTI-CNR; Mercatorum University; National Inter-University Consortium for Telecommunications;
The adoption of deep learning-based solutions practically pervades all the diverse areas of our everyday life, showing improved performances with respect to other classical systems. Since many applications deal with sensible data and procedures, a strong demand to know the actual reliability of such technologies is always present. This work analyzes the robustness characteristics of a specific kind of deep neural network, the neural ordinary differential equations (N-ODE) network. They seem very interesting for their effectiveness and a peculiar property based on a test-time tunable parameter that permits obtaining a trade-off between accuracy and efficiency. In addition, adjusting such a tolerance parameter grants robustness against adversarial attacks. Notably, it is worth highlighting how decoupling the values of such a tolerance between training and test time can strongly reduce the attack success rate. On this basis, we show how such tolerance can be adopted, during the prediction phase, to improve the robustness of N-ODE to adversarial attacks. In particular, we demonstrate how we can exploit this property to construct an effective detection strategy and increase the chances of identifying adversarial examples in a non-zero knowledge attack scenario. Our experimental evaluation involved two standard image classification benchmarks. This showed that the proposed detection technique provides high rejection of adversarial examples while maintaining most of the pristine samples.
Open Access
Journal article
N/A
Chen Feng; Georgios Tzimiropoulos; Ioannis Patras;
Queen Mary University of London;
Despite the large progress in supervised learning with neural networks, there are significant challenges in obtaining high-quality, large-scale and accurately labelled datasets. In such a context, how to learn in the presence of noisy labels has received more and more attention. As a relatively complex problem, in order to achieve good results, current approaches often integrate components from several fields, such as supervised learning, semi-supervised learning, transfer learning and resulting in complicated methods. Furthermore, they often make multiple assumptions about the type of noise of the data. This affects the model robustness and limits its performance under different noise conditions. In this paper, we consider a novel problem setting, Learning with Unknown Label Noise}(LULN), that is, learning when both the degree and the type of noise are unknown. Under this setting, unlike previous methods that often introduce multiple assumptions and lead to complex solutions, we propose a simple, efficient and robust framework named Sample Selection and Relabelling(SSR), that with a minimal number of hyperparameters achieves SOTA results in various conditions. At the heart of our method is a sample selection and relabelling mechanism based on a non-parametric KNN classifier~(NPK) $g_q$ and a parametric model classifier~(PMC) $g_p$, respectively, to select the clean samples and gradually relabel the noisy samples. Without bells and whistles, such as model co-training, self-supervised pre-training and semi-supervised learning, and with robustness concerning the settings of its few hyper-parameters, our method significantly surpasses previous methods on both CIFAR10/CIFAR100 with synthetic noise and real-world noisy datasets such as WebVision, Clothing1M and ANIMAL-10N. Code is available at https://github.com/MrChenFeng/SSR_BMVC2022.
Open Access
Conference paper
N/A
Alberto Del Bimbo; Daniele Mugnai; Federico Pernici; Matteo Bruni; Niccolò Biondi
University of Florence;
In this article, we propose a method to partially mimic natural intelligence for the problem of lifelong learning representations that are compatible. We take the perspective of a learning agent that is interested in recognizing object instances in an open dynamic universe in a way in which any update to its internal feature representation does not render the features in the gallery unusable for visual search. We refer to this learning problem as Compatible Lifelong Learning Representations (CL2 R), as it considers compatible representation learning within the lifelong learning paradigm. We identify stationarity as the property that the feature representation is required to hold to achieve compatibility and propose a novel training procedure that encourages local and global stationarity on the learned representation. Due to stationarity, the statistical properties of the learned features do not change over time, making them interoperable with previously learned features. Extensive experiments on standard benchmark datasets show that our CL2 R training procedure outperforms alternative baselines and state-of-the-art methods. We also provide novel metrics to specifically evaluate compatible representation learning under catastrophic forgetting in various sequential learning tasks. Code is available at https://github.com/NiccoBiondi/CompatibleLifelongRepresentation.
Open Access
Journal article
ACM Multimedia Systems Conference
Emmanouil Krasanakis; Ioannis Kompatsiaris; Symeon Papadopoulos
CERTH;