VISAPP 2021 Abstracts


Area 1 - Image and Video Formation, Preprocessing and Analysis

Full Papers
Paper Nr: 9
Title:

Stabilizing GANs with Soft Octave Convolutions

Authors:

Ricard Durall, Franz-Josef Pfreundt and Janis Keuper

Abstract: Motivated by recently published methods using frequency decompositions of convolutions (e.g. Octave Convolutions), we propose a novel convolution scheme to stabilize the training and reduce the likelihood of a mode collapse. The basic idea of our approach is to split convolutional filters into additive high and low frequency parts, while shifting weight updates from low to high during the training. Intuitively, this method forces GANs to learn low frequency coarse image structures before descending into fine (high frequency) details. We also show, that the use of the proposed soft octave convolutions reduces common artifacts in the frequency domain of generated images. Our approach is orthogonal and complementary to existing stabilization methods and can simply be plugged into any CNN based GAN architecture. Experiments on the CelebA dataset show the effectiveness of the proposed method.

Paper Nr: 10
Title:

Latent Space Conditioning on Generative Adversarial Networks

Authors:

Ricard Durall, Kalun Ho, Franz-Josef Pfreundt and Janis Keuper

Abstract: Generative adversarial networks are the state of the art approach towards learned synthetic image generation. Although early successes were mostly unsupervised, bit by bit, this trend has been superseded by approaches based on labelled data. These supervised methods allow a much finer-grained control of the output image, offering more flexibility and stability. Nevertheless, the main drawback of such models is the necessity of annotated data. In this work, we introduce an novel framework that benefits from two popular learning techniques, adversarial training and representation learning, and takes a step towards unsupervised conditional GANs. In particular, our approach exploits the structure of a latent space (learned by the representation learning) and employs it to condition the generative model. In this way, we break the traditional dependency between condition and label, substituting the latter by unsupervised features coming from the latent space. Finally, we show that this new technique is able to produce samples on demand keeping the quality of its supervised counterpart.

Paper Nr: 15
Title:

Symmetric Skip Connection Wasserstein GAN for High-resolution Facial Image Inpainting

Authors:

Jireh Jam, Connah Kendrick, Vincent Drouard, Kevin Walker, Gee-Sern Hsu and Moi H. Yap

Abstract: The state-of-the-art facial image inpainting methods achieved promising results but face realism preservation remains a challenge. This is due to limitations such as; failures in preserving edges and blurry artefacts. To overcome these limitations, we propose a Symmetric Skip Connection Wasserstein Generative Adversarial Network (S-WGAN) for high-resolution facial image inpainting. The architecture is an encoder-decoder with convolutional blocks, linked by skip connections. The encoder is a feature extractor that captures data abstractions of an input image to learn an end-to-end mapping from an input (binary masked image) to the ground-truth. The decoder uses learned abstractions to reconstruct the image. With skip connections, S-WGAN transfers image details to the decoder. Additionally, we propose a Wasserstein-Perceptual loss function to preserve colour and maintain realism on a reconstructed image. We evaluate our method and the state-of-the-art methods on CelebA-HQ dataset. Our results show S-WGAN produces sharper and more realistic images when visually compared with other methods. The quantitative measures show our proposed S-WGAN achieves the best Structure Similarity Index Measure (SSIM) of 0.94.

Paper Nr: 26
Title:

Focus-and-Context Skeleton-based Image Simplification using Saliency Maps

Authors:

Jieying Wang, Leonardo M. Joao, Alexandre Falcão, Jiří Kosinka and Alexandru Telea

Abstract: Medial descriptors offer a promising way for representing, simplifying, manipulating, and compressing images. However, to date, these have been applied in a global manner that is oblivious to salient features. In this paper, we adapt medial descriptors to use the information provided by saliency maps to selectively simplify and encode an image while preserving its salient regions. This allows us to improve the trade-off between compression ratio and image quality as compared to the standard dense-skeleton method while keeping perceptually salient features, in a focus-and-context manner. We show how our method can be combined with JPEG to increase overall compression rates at the cost of a slightly lower image quality. We demonstrate our method on a benchmark composed of a broad set of images.

Paper Nr: 57
Title:

Interpretation of Human Behavior from Multi-modal Brain MRI Images based on Graph Deep Neural Networks and Attention Mechanism

Authors:

Refka Hanachi, Akrem Sellami and Imed R. Farah

Abstract: Interpretation of human behavior by exploiting the complementarity of the information offered by multimodal functional magnetic resonance imaging (fMRI) data is a challenging task. In this paper, we propose to fuse task-fMRI for brain activation and rest-fMRI for functional connectivity with the incorporation of structural MRI (sMRI) as an adjacency matrix to maintain the rich spatial structure between voxels of the brain. We consider then the structural-functional brain connections (3D mesh) as a graph. The aim is to quantify each subject’s performance in voice recognition and identification. More specifically, we propose an advanced multi-view graph auto-encoder based on the attention mechanism called MGATE, which seeks at learning better representation from both modalities task- and rest-fMRI using the Brain Adjacency Graph (BAG), which is constructed based on sMRI. It yields a multi-view representation learned at all vertices of the brain, which be used as input to our trace regression model in order to predict the behavioral score of each subject. Experimental results show that the proposed model achieves better prediction rates, and reaches competitive high performances compared to various existing graph representation learning models in the stateof-the-art.

Paper Nr: 59
Title:

Driver’s Eye Fixation Prediction by Deep Neural Network

Authors:

Mohsen Shirpour, Steven S. Beauchemin and Michael A. Bauer

Abstract: The driving environment is a complex dynamic scene in which a driver’s eye fixation interacts with traffic scene objects to protect the driver from dangerous situations. Prediction of a driver’s eye fixation plays a crucial role in Advanced Driving Assistance Systems (ADAS) and autonomous vehicles. However, currently, no computational framework has been introduced to combine the bottom-up saliency map with the driver’s head pose and gaze direction to estimate a driver’s eye fixation. In this work, we first propose convolution neural networks to predict the potential saliency regions in the driving environment, and then use the probability of the driver gaze direction, given head pose as a top-down factor. We evaluate our model on real data gathered during drives in an urban and suburban environment with an experimental vehicle. Our analyses show promising results.

Paper Nr: 80
Title:

A Retinex Inspired Bilateral Filter for Enhancing Images under Difficult Light Conditions

Authors:

Michela Lecca

Abstract: This paper presents SuPeR-B, a novel, Retinex inspired color spatial algorithm to enhance images acquired under difficult light conditions, such as pictures containing dark and bright regions caused by backlight and/or local, not diffused spotlight. SuPeR-B takes as input a color image and improves its readability by processing its color channels independently in accordance with some principles of the Retinex theory. Precisely, SuPeR-B re-works the channel intensity of each pixels accounting for differences computed both in the spatial and intensity domains. In this way, SuPeR-B acts as a bilater filter. The experiments, carried out on a real-world dataset, shows that SuPeR-B ensures good enhancement results, also in comparison with other state-of-the-art algorithms: SuPeR-B improves the overall content of the image, making the dark regions brighter and more contrasted, while lowering possible chromatic dominants of the light.

Paper Nr: 83
Title:

Using Geometric Graph Matching in Image Registration

Authors:

Giomar S. Olivera, Aura Conci and Leandro F. Fernandes

Abstract: Image registration is a fundamental task in many medical applications, allowing interpreting and analyzing images acquired using different technologies, from different viewpoints, or at different times. The image registration task is particularly challenging when the images have little high-frequency information and when average brightness changes over time, as is the case with infrared breast exams acquired using a dynamic protocol. This paper presents a new method for registering these images, where each one is represented in a compact form by a geometric graph, and the registration is done by comparing graphs. The application of the proposed technique consists of five stages: (i) pre-process the infrared breast image; (ii) extract the internal linear structures that characterize arteries, vascular structures, and other hot regions; (iii) create a geometric graph to represent such structures; (iv) perform structure registration by comparing graphs; and (v) estimate the transformation function. The Dice coefficient, Jaccard index, and total overlap agreement measure are considered to evaluate the results’ quality. The output obtained on a public database of infrared breast images is compared against SURF interest points for image registration and a state of the art approach for infrared breast image registration from the literature. The analyzes show that the proposed method outperforms others.

Paper Nr: 90
Title:

Precise Upright Adjustment of Panoramic Images

Authors:

Nobuyuki Kita

Abstract: An equirectangular format is often used when an omnidirectional image taken by a 360c̊amera is expanded into a single image, which is referred to as a panoramic image. If an omnidirectional image is expanded into a panoramic image using the upward direction of the inclined camera, the image may look unstable and “wavy.” It is important to correct the inclination using the zenith direction as a reference; this task is called "upright adjustment." In this paper, we propose a method for upright adjustment by estimating the inclination of a camera in a wavy panoramic image using vertical lines in the environment, e.g., indoors or outdoors near a building. The advantage of the proposed method is that 3D straight lines are detected with high accuracy and high speed directly from a panoramic image without passing through a spherical image. Experimental results of upright adjustment of the wavy panoramic images taken in multiple places show that the proposed method is accurate and can be applied in a wide range of indoor and outdoor scenarios.

Paper Nr: 103
Title:

Convolution Filter based Efficient Multispectral Image Demosaicking for Compact MSFAs

Authors:

Vishwas Rathi and Puneet Goyal

Abstract: Using the multispectral filter arrays (MSFA) and demosaicking, the low-cost multispectral imaging systems can be developed that are useful in many applications. However, multispectral image demosaicking is a challenging task because of the very sparse sampling of each spectral band present in the MSFA. The selection of MSFA is very crucial for the applicability and for the better performance of demosaicking methods. Here, we consider widely accepted and preferred MSFAs that are compact and designed using binary tree based approach and for these compact MSFAs, we propose a new efficient demosaicking method that relies on performing filtering operations and can be used for different bands size multispectral images. We also present new filters for demosaicking based on the probability of appearance of spectral bands in binary-tree based MSFAs. Detailed experiments are performed on multispectral images of two different benchmark datasets. Experimental results reveal that the proposed method has wider applicability and is efficient; it consistently outperforms the existing state-of-the-art generic multispectral image demosaicking methods in terms of different image quality metrics considered.

Paper Nr: 125
Title:

Clustering-based Sequential Feature Selection Approach for High Dimensional Data Classification

Authors:

M. Alimoussa, A. Porebski, N. Vandenbroucke, R. H. Thami and S. El Fkihi

Abstract: Feature selection has become the focus of many research applications specially when datasets tend to be huge. Recently, approaches that use feature clustering techniques have gained much attention for their ability to improve the selection process. In this paper, we propose a clustering-based sequential feature selection approach based on a three step filter model. First, irrelevant features are removed. Then, an automatic feature clustering algorithm is applied in order to divide the feature set into a number of clusters in which features are redundant or correlated. Finally, one feature is sequentially selected per group. Two experiments are conducted, the first one using six real wold numerical data and the second one using features extracted from three color texture image datasets. Compared to seven feature selection algorithms, the obtained results show the effectiveness and the efficiency of our approach.

Paper Nr: 127
Title:

Dense Open-set Recognition with Synthetic Outliers Generated by Real NVP

Authors:

Matej Grcić, Petra Bevandić and Siniša Segvić

Abstract: Today’s deep models are often unable to detect inputs which do not belong to the training distribution. This gives rise to confident incorrect predictions which could lead to devastating consequences in many important application fields such as healthcare and autonomous driving. Interestingly, both discriminative and generative models appear to be equally affected. Consequently, this vulnerability represents an important research challenge. We consider an outlier detection approach based on discriminative training with jointly learned synthetic outliers. We obtain the synthetic outliers by sampling an RNVP model which is jointly trained to generate datapoints at the border of the training distribution. We show that this approach can be adapted for simultaneous semantic segmentation and dense outlier detection. We present image classification experiments on CIFAR-10, as well as semantic segmentation experiments on three existing datasets (StreetHazards, WD-Pascal, Fishyscapes Lost & Found), and one contributed dataset. Our models perform competitively with respect to the state of the art despite producing predictions with only one forward pass.

Paper Nr: 130
Title:

Shared Information-Based Late Fusion for Four Mammogram Views Retrieval using Data-driven Distance Selection

Authors:

Amira Jouirou, Abir Baâzaoui and Walid Barhoumi

Abstract: Content-Based Mammogram Retrieval (CBMR) represents the most effective method for the breast cancer diagnosis, especially CBMR based on the fusion of different mammogram views. In this work, an efficient four-view CBMR method is proposed in order to further improve the mammogram retrieval performance. The proposed method consists in combining the retrieval results of the provided four views from the screening mammography, which are the Medio-Lateral Oblique (MLO) and the Cranio-Caudal (CC) views of the Left (LMLO and LCC) and the Right (RMLO and RCC) breasts. In order to personalize each query view in the final result, a classified mammogram dataset has been used to retrieve the relevant mammograms to the query. Indeed, the proposed method takes as input four query views corresponding to the four different views (LMLO, LCC, RMLO and RCC) and displays the most similar mammogram cases to each breast view using a dynamic data-driven distance selection and the shared information. In particular, we explore the use of random forest machine learning in order to predict the most appropriate similarity measure to each query view and the late fusion from the four view result-level, through the shared information concept, for the final retrieval. According to their clinical cases, the retrieved mammograms can be analyzed in order to help radiologists to make the right decision relatively to the four-view mammogram query. The reported experimental results from the challenging Digital Database for Screening Mammography (DDSM) dataset proved the effectiveness of the proposed four-view CBMR method.

Paper Nr: 142
Title:

Inspection of Industrial Coatings based on Multispectral BTF

Authors:

Ryosuke Suzuki, Fumihiko Sakaue, Jun Sato, Ryuichi Fukuta, Taketo Harada and Kazuhisa Ishimaru

Abstract: In this paper, we propose a method to inspect coatings of industrial products in a factory automation system. The coating of industrial products is important because the coating directly affects the impression of the product, and a large amount of cost is spent on its inspection. Because lots of colors are used in the coating of industrial products, as well as there are various surface treatments such as matte and mirror finishes, the appearance of these products varies hugely. Therefore, it is difficult to obtain the properties of the surfaces by ordinary camera systems, and thus, they are inspected manually in the current system in most cases. In this paper, we present a method of representing surface properties of them, called multispectral BTF, by taking products under narrow-band light from various directions. We also show a method for inspection using a one-class discriminator based on Deep Neural Network using the multispectral BTF. Several experimental results show that our proposed BTF and one-class classifier can inspect various kinds of coating.

Paper Nr: 160
Title:

Hallucinating Saliency Maps for Fine-grained Image Classification for Limited Data Domains

Authors:

Carola Figueroa-Flores, Bogdan Raducanu, David Berga and Joost van de Weijer

Abstract: It has been shown that saliency maps can be used to improve the performance of object recognition systems, especially on datasets that have only limited training data. However, a drawback of such an approach is that it requires a pre-trained saliency network. In the current paper, we propose an approach which does not require explicit saliency maps to improve image classification, but they are learned implicitely, during the training of an end-to-end image classification task. We show that our approach obtains similar results as the case when the saliency maps are provided explicitely. We validate our method on several datasets for fine-grained classification tasks (Flowers, Birds and Cars), and show that especially for domains with limited data the proposed method significantly improves the results.

Paper Nr: 200
Title:

An Enhanced Adversarial Network with Combined Latent Features for Spatio-temporal Facial Affect Estimation in the Wild

Authors:

Decky Aspandi, Federico Sukno, Björn Schuller and Xavier Binefa

Abstract: Affective Computing has recently attracted the attention of the research community, due to its numerous applications in diverse areas. In this context, the emergence of video-based data allows to enrich the widely used spatial features with the inclusion of temporal information. However, such spatio-temporal modelling often results in very high-dimensional feature spaces and large volumes of data, making training difficult and time consuming. This paper addresses these shortcomings by proposing a novel model that efficiently extracts both spatial and temporal features of the data by means of its enhanced temporal modelling based on latent features. Our proposed model consists of three major networks, coined Generator, Discriminator, and Combiner, which are trained in an adversarial setting combined with curriculum learning to enable our adaptive attention modules. In our experiments, we show the effectiveness of our approach by reporting our competitive results on both the AFEW-VA and SEWA datasets, suggesting that temporal modelling improves the affect estimates both in qualitative and quantitative terms. Furthermore, we find that the inclusion of attention mechanisms leads to the highest accuracy improvements, as its weights seem to correlate well with the appearance of facial movements, both in terms of temporal localisation and intensity. Finally, we observe the sequence length of around 160 ms to be the optimum one for temporal modelling, which is consistent with other relevant findings utilising similar lengths.

Paper Nr: 207
Title:

Hybrid-S2S: Video Object Segmentation with Recurrent Networks and Correspondence Matching

Authors:

Fatemeh Azimi, Stanislav Frolov, Federico Raue, Jörn Hees and Andreas Dengel

Abstract: One-shot Video Object Segmentation (VOS) is the task of pixel-wise tracking an object of interest within a video sequence, where the segmentation mask of the first frame is given at inference time. In recent years, Recurrent Neural Networks (RNNs) have been widely used for VOS tasks, but they often suffer from limitations such as drift and error propagation. In this work, we study an RNN-based architecture and address some of these issues by proposing a hybrid sequence-to-sequence architecture named HS2S, utilizing a dual mask propagation strategy that allows incorporating the information obtained from correspondence matching. Our experiments show that augmenting the RNN with correspondence matching is a highly effective solution to reduce the drift problem. The additional information helps the model to predict more accurate masks and makes it robust against error propagation. We evaluate our HS2S model on the DAVIS2017 dataset as well as Youtube-VOS. On the latter, we achieve an improvement of 11.2pp in the overall segmentation accuracy over RNN-based state-of-the-art methods in VOS. We analyze our model’s behavior in challenging cases such as occlusion and long sequences and show that our hybrid architecture significantly enhances the segmentation quality in these difficult scenarios.

Paper Nr: 223
Title:

Enhanced CycleGAN Dehazing Network

Authors:

Zahra Anvari and Vassilis Athitsos

Abstract: Single image dehazing is a challenging problem, and it is far from solved. Most current solutions require paired image datasets that include both hazy images and their corresponding haze-free ground-truth. However, in reality lighting conditions and other factors can produce a range of haze-free images that can serve as ground truth for a hazy image, and a single ground truth image cannot capture that range. This limits the scalability and practicality of paired methods in real-world applications. In this paper, we focus on unpaired single image dehazing and reduce the image dehazing problem to an unpaired image-to-image translation and propose an Enhanced CycleGAN Dehazing Network (ECDN). We enhance CycleGAN from different angles for the dehazing purpose. We employ a global-local discriminator structure to deal with spatially varying haze. We define self-regularized color loss and utilize it along with perceptual loss to generate more realistic and visually pleasing images. We use an encoder-decoder architecture with residual blocks in the generator with skip connections so that the network better preserves the details. Through an ablation study, we demonstrate the effectiveness of different modules in the performance of the proposed network. Our extensive experiments over two benchmark datasets show that our network outperforms previous work in terms of PSNR and SSIM.

Short Papers
Paper Nr: 4
Title:

Fast Bridgeless Pyramid Segmentation for Organized Point Clouds

Authors:

Martin Madaras, Martin Stuchlík and Matúš Talčík

Abstract: An intelligent automatic robotic system needs to understand the world as fast as possible. A common way to capture the world is to use a depth camera. The depth camera produces an organized point cloud that later needs to be processed to understand the scene. Usually, segmentation is one of the first preprocessing steps for the data processing pipeline. Our proposed pyramid segmentation is a simple, fast and lightweight split- and-merge method designed for depth cameras. The algorithm consists of two steps, edge detection and a hierarchical method for bridgeless labeling of connected components. The pyramid segmentation generates the seeds hierarchically, in a top-down manner, from the largest regions to the smallest ones. The neighboring areas around the seeds are filled in a parallel manner, by rendering axis-aligned line primitives, which makes the performance of the method fast. The hierarchical approach of labeling enables to connect neighboring segments without unnecessary bridges in a parallel way that can be efficiently implemented using CUDA.

Paper Nr: 5
Title:

Combating Mode Collapse in GAN Training: An Empirical Analysis using Hessian Eigenvalues

Authors:

Ricard Durall, Avraam Chatzimichailidis, Peter Labus and Janis Keuper

Abstract: Generative adversarial networks (GANs) provide state-of-the-art results in image generation. However, despite being so powerful, they still remain very challenging to train. This is in particular caused by their highly non-convex optimization space leading to a number of instabilities. Among them, mode collapse stands out as one of the most daunting ones. This undesirable event occurs when the model can only fit a few modes of the data distribution, while ignoring the majority of them. In this work, we combat mode collapse using second-order gradient information. To do so, we analyse the loss surface through its Hessian eigenvalues, and show that mode collapse is related to the convergence towards sharp minima. In particular, we observe how the eigenvalues of the G are directly correlated with the occurrence of mode collapse. Finally, motivated by these findings, we design a new optimization algorithm called nudged-Adam (NuGAN) that uses spectral information to overcome mode collapse, leading to empirically more stable convergence properties.

Paper Nr: 13
Title:

RGB-D-based Human Detection and Segmentation for Mobile Robot Navigation in Industrial Environments

Authors:

Oguz Kedilioglu, Markus Lieret, Julia Schottenhamml, Tobias Würfl, Andreas Blank, Andreas Maier and Jörg Franke

Abstract: Automated guided vehicles (AGV) are nowadays a common option for the efficient and automated in-house transportation of various cargo and materials. By the additional application of unmanned aerial vehicles (UAV) in the delivery and intralogistics sector this flow of materials is expected to be extended by the third dimension within the next decade. To ensure a collision-free movement for those vehicles optical, ultrasonic or capacitive distance sensors are commonly employed. While such systems allow a collision-free navigation, they are not able to distinguish humans from static objects and therefore require the robot to move at a human-safe speed at any time. To overcome these limitations and allow an environment sensitive collision avoidance for UAVs and AGVs we provide a solution for the depth camera based real-time semantic segmentation of workers in industrial environments. The semantic segmentation is based on an adapted version of the deep convolutional neural network (CNN) architecture FuseNet. After explaining the underlying methodology we present an automated approach for the generation of weakly annotated training data and evaluate the performance of the trained model compared to other well-known approaches

Paper Nr: 22
Title:

GeST: A New Image Segmentation Technique based on Graph Embedding

Authors:

Anthony Perez

Abstract: We propose a new framework to develop image segmentation algorithms using graph embedding, a well-studied tool from complex network analysis. So-called embeddings are low-dimensional representations of nodes of the graph that encompass several structural properties such as neighborhoods and community structure. The main idea of our framework is to first consider an image as a set of superpixels, and then compute embeddings for the corresponding undirected weighted Region Adjacency Graph. The resulting segmentation is then obtained by clustering embeddings. To the best of our knowledge, known complex network-based segmentation techniques rely on community detection algorithms. By introducing graph embedding for image segmentation, we combine two nice properties of aforementioned segmentation techniques, namely working on small graphs with low-dimensional representations. To illustrate the relevance of our approach, we propose GeST, an implementation of this framework using node2vec and agglomerative clustering. We experiment our algorithm on a publicly available dataset and show that it produces qualitative results compared to state-of-the-art segmentation techniques while requiring low computational complexity and memory.

Paper Nr: 32
Title:

Reinforcement Learning based Video Summarization with Combination of ResNet and Gated Recurrent Unit

Authors:

Muhammad S. Afzal and Muhammad A. Tahir

Abstract: Video cameras are getting ubiquitous with passage of time. Huge amount of video data is generated daily in this world that needs to be handled efficiently in limited storage and processing power. Video summarization renders the best way to quickly review over lengthy videos along with controlling storage and processing power requirements. Deep reinforcement-deep summarization network (DR-DSN) is a popular method for video summarization but performance of this method is limited and can be enhanced with better representation of video data. Most recently, it has been observed that deep residual networks are quite successful in many computer vision applications including video retrieval and captioning. In this paper, we have investigated deep feature representation for video summarization using deep residual network where ResNet 152 is being used to extract deep videos features. To speed up the model, long short term memory is replaced with gated recurrent unit, thus gave us flexibility to add another RNN layer which resulted in significant improvement in performance. With this combination of ResNet-152 and two layered gated recurrent unit (GRU), we performed experiments on SumMe video dataset and got results not only better than DR-DSN but also better than several state of art video summarization methods.

Paper Nr: 35
Title:

PIU-Net: Generation of Virtual Panoramic Views from Colored Point Clouds

Authors:

Michael G. Adam and Eckehard Steinbach

Abstract: As VR-systems become more and more widespread, the interest in high-quality content increases drastically. One way of generating such data is by digitizing real world environments using SLAM-based mapping devices, which capture both the geometry of the environment (typically as point clouds) and its appearance (typically as a small set of RGB panoramas). However, when creating such digital representations of real-world spaces, artifacts and missing data cannot be fully avoided. Furthermore, free movement is often restricted. In this paper, we introduce a technique, which allows for the generation of high quality panoramic views at any position within a captured real world scenario. Our method consists of two steps. First, we render virtual panoramas from the projected point cloud data. Those views exhibit imperfections in comparison to the real panoramas, which can be corrected in the second step by an inpainting neural network. The network itself is trained using a small set of panoramic images captured during the mapping process. In order to take full advantage of the panoramic information, we use a U-Net-like structure with circular convolutions. Further, a custom perceptual panoramic loss is applied. The resulting virtual panoramas show high quality and spatial consistency. Furthermore, the network learns to correct erroneous point cloud data. We evaluate the proposed approach by generating virtual panoramas along novel trajectories where the panorama positions deviate from the originally selected capturing points and observe that the proposed approach is able to generate smooth and temporally consistent walkthrough sequences.

Paper Nr: 67
Title:

Deep Convolutional Second Generation Curvelet Transform-based MR Image for Early Detection of Alzheimer’s Disease

Authors:

Takrouni Wiem and Douik Ali

Abstract: Merging neuroimaging data with machine learning has an important potential for the early diagnosis of Alzheimer’s Disease (AD) and Mild Cognitive Impairment (MCI). The applicability of multiclass classification and the prediction to define the progress of different stages of the disease have been relatively understudied. This paper presents a short review of the deep learning history and introduces a new solution for delineating changes in each stage of AD. Our Deep Convolutional Second-Generation Curvelet Transform Network (SGCTN) is divided into both levels: The feature learning level is the first task that can combine a Second-Generation Curvelet (SGC) with autoencoder trained features. Then, for each hidden layer, a pooling is used to obtain our convolutional neural network. This network is used to learn predictive information for binary and multiclass classification. Our experiments test uses a different number of Cognitively Normal (CN), AD, early EMCI, and Later LMCI subjects from the AD Neuroimaging Initiative (ADNI). Magnetic Resonance Imaging (MRI) information modalities are considered as input. The proposed DSGCCN achieves 98.1% accuracy for delineating the early MCI from CN. Furthermore, for detecting the distinctive level of AD, a multiclass classification test realizes the global accuracy of , and it more particularly differentiates MCI and AD groups from the CN group with 96% accuracy. Compared to the state-of-the-art deep approach, our results indicate that our architecture can achieve better performance for the same databases. Model analysis based (SGC) can improve the classification performance via comparison experiments.

Paper Nr: 74
Title:

Efficient Multi-stream Temporal Learning and Post-fusion Strategy for 3D Skeleton-based Hand Activity Recognition

Authors:

Yasser Boutaleb, Catherine Soladie, Nam-Duong Duong, Amine Kacete, Jérôme Royan and Renaud Seguier

Abstract: Recognizing first-person hand activity is a challenging task, especially when not enough data are available. In this paper, we tackle this challenge by proposing a new hybrid learning pipeline for skeleton-based hand activity recognition, which is composed of three blocks. First, for a given sequence of hand’s joint positions, the spatial features are extracted using a dedicated combination of local and global spacial hand-crafted features. Then, the temporal dependencies are learned using a multi-stream learning strategy. Finally, a hand activity sequence classifier is learned, via our Post-fusion strategy, applied to the previously learned temporal dependencies. The experiments, evaluated on two real-world data sets, show that our approach performs better than the state-of-the-art approaches. For more ablation study, we compared our Post-fusion strategy with three traditional fusion baselines and showed an improvement above 2.4% of accuracy.

Paper Nr: 101
Title:

Hybrid Feature based Pyramid Network for Nighttime Semantic Segmentation

Authors:

Yuqi Li, Yinan Ma, Jing Wu and Chengnian Long

Abstract: In recent years, considerable progress has been made on semantic segmentation tasks. However, most existing works focus on only day-time images under favorable illumination conditions. In this work, we aim at nighttime semantic segmentation, which is remaining to be solved due to the problems of over-and under-exposures caused by complex lighting conditions and the lack of trainable nighttime dataset as pixel-level annotation requires extensive time and human effort. We (1) propose a hybrid network combining image pyramid network and Gray Level Co-occurrence Matrix (GLCM). GLCM is a significant descriptor of texture information, as statistical features to compensate the missing texture information in the over-and under-exposures problem at night. (2) design an exposure-awareness encoder network by fusing hybrid features hierarchically in GLCM fusion layers. (3) elaborately generate a trainable nighttime dataset, Carla-based Synthesis Nighttime dataset (CSN dataset), with 10027 synthesis images to resolve the problem of large-scale human annotations. To check whether the network trained on synthesized images is effective in the real world we also collect a real-world dataset called NightCampus with 500 nighttime images with annotations used as test dataset. We prove that our network trained on synthetic dataset yielding top performances on our real-world dataset.

Paper Nr: 105
Title:

A New Generic Progressive Approach based on Spectral Difference for Single-sensor Multispectral Imaging System

Authors:

Vishwas Rathi, Medha Gupta and Puneet Goyal

Abstract: Single-sensor RGB cameras use a color filter array to capture the initial image and demosaicking technique to reproduce a full-color image. A similar concept can be extended from the color filter array (CFA) to a multispectral filter array (MSFA). It allows us to capture a multispectral image using a single-sensor at a low cost using MSFA demosaicking. The binary tree based MSFAs can be designed for any k-band multispectral images and are preferred, however the existing demosaicking methods are either not generic or are of limited efficacy. In this paper, we propose a new generic demosaicking method applicable on any k-band MSFAs, designed using preferred binary-tree based approach. The proposed method involves applying the bilinear interpolation and estimating the spectral correlation differences appropriately and progressively. Experimental results on two different multispectral image datasets consistently show that our method outperforms the existing state-of-art methods, both visually and quantitatively, as per the different metrics.

Paper Nr: 106
Title:

Curve based Fast Detail Enhancement for Biomedical Images

Authors:

Ran Fei, Ying Weng, Yiming Zhang and Jonathan Lund

Abstract: Biomedical images are widely collected from various applications, which are used for patients' screening, diagnosis and treatment. The dark regions of biomedical images may play as an important role as the bright regions. The enhanced details in the dark regions of biomedical images simultaneously maintain the quality of the rest of the images and reveal more information for doctors and surgeons in medical procedures. This paper proposes a fast method to adaptively enhance the details in the dark regions of biomedical images, including X-rays, video frames of laparoscopy in minimally invasive surgery (MIS).

Paper Nr: 123
Title:

A Summarized Semantic Structure to Represent Manipulation Actions

Authors:

Tobias Strübing, Fatemeh Ziaeetabar and Florentin Wörgötter

Abstract: To represent human manipulation actions in a simple and understandable way, we had proposed a framework called enriched semantic event chains (eSEC) which creates a temporal sequence of static and dynamic spatial relations between objects in a manipulation. The eSEC framework has so far only been used in manipulation actions consisting of one hand. As the eSECs descriptors are in the form of huge matrices, we need to have a concise version of them. Here, we want to extend this framework to interactions which involve more hands. Therefore, we applied statistical and semantic analyses to summarize the current eSEC while preserving its important features and introducing an enhanced eSEC (e2SEC). This summarization is done by reducing the number of rows in an eSEC matrix and merging semantic spatial relations between manipulated objects. Eventually, we presented the new e2SEC framework which has 20% fewer rows, 16.7% less static spatial and 11.1% less dynamic spatial relations while still maintaining the eSEC efficiency in recognition and differentiation of manipulation actions. This simplification paves the way for a simpler recognition and predicting complex actions and interactions in a shorter time and is beneficial in real time applications such as human-robot interactions.

Paper Nr: 128
Title:

An Effective 3D ResNet Architecture for Stereo Image Retrieval

Authors:

E. Ghodhbani, M. Kaaniche and A. Benazza-Benyahia

Abstract: While recent stereo images retrieval techniques have been developed based mainly on statistical approaches, this work aims to investigate deep learning ones. More precisely, our contribution consists in designing a two-branch neural networks to extract deep features from the stereo pair. In this respect, a 3D residual network architecture is first employed to exploit the high correlation existing in the stereo pair. This 3D model is then combined with a 2D one applied to the disparity maps, resulting in deep feature representations of the texture information as well as the depth one. Our experiments, carried out on a large scale stereo image dataset, have shown the good performance of the proposed approach compared to the state-of-the-art methods.

Paper Nr: 141
Title:

Real-time Multispectral Image Processing and Registration on 3D Point Cloud for Vineyard Analysis

Authors:

Thibault Clamens, Georgios Alexakis, Raphaël Duverne, Ralph Seulin, Eric Fauvet and David Fofi

Abstract: Nowadays, precision agriculture and precision viticulture are under strong development. In order to accomplish effective actions, robots require robust perception of the culture and the surrounding environment. Computer vision systems have to identify plant parts (branches, stems, leaves, flowers, fruits, vegetables, etc.) and their respective health status. Moreover, they must merge various plant information, to measure agronomic indices, to classify them and finally to extract data to enable the agriculturist or expert to make a relevant decision. We propose a real-time method to acquire, process and register multispectral images fused to 3D. The sensors system, consisting of a multispectral camera and a RGB-D sensor, can be embedded on a ground robot or other terrestrial vehicles. Experiments conducted in the vineyard field demonstrate that agronomic analyses are allowed.

Paper Nr: 149
Title:

Upsampling Attention Network for Single Image Super-resolution

Authors:

Zhijie Zheng, Yuhang Jiao and Guangyou Fang

Abstract: Recently, convolutional neural network (CNN) has been widely used in single image super-resolution (SISR) and made significant advances. However, most of the existing CNN-based SISR models ignore fully utilization of the extracted features during upsampling, causing information bottlenecks, hence hindering the expressive ability of networks. To resolve these problems, we propose an upsampling attention network (UAN) for richer feature extraction and reconstruction. Specifically, we present a residual attention groups (RAGs) based structure to extract structural and frequency information, which is composed of several residual feature attention blocks (RFABs) with a non-local skip connection. Each RFAB adaptively rescales spatial- and channel-wise features by paying attention to correlations among them. Furthermore, we propose an upsampling attention block (UAB), which not only applies parallel upsampling processes to obtain richer feature representations, but also combines them to obtain better reconstruction results. Experiments on standard benchmarks show the advantage of our UAN over state-of-the-art methods both in objective metrics and visual qualities.

Paper Nr: 156
Title:

Deepfake Detection using Capsule Networks and Long Short-Term Memory Networks

Authors:

Akul Mehra, Luuk Spreeuwers and Nicola Strisciuglio

Abstract: With the recent advancements of technology, and in particular with graphics processing and artificial intelligence algorithms, fake media generation has become easier. Using deep learning techniques like Deepfakes and FaceSwap, anyone can generate fake videos by manipulating the face/voice of target subjects in videos. These AI synthesized videos are a big threat to the authenticity and trustworthiness of online information and can be used for malicious purposes. Detecting face tampering in videos is of utmost importance. We propose a spatio-temporal hybrid model of Capsule Networks integrated with Long Short-Term Memory (LSTM) networks. This model exploits the inconsistencies in videos to distinguish real and fake videos. We use three different frame selection techniques and show that frame selection has a significant impact on the performance of models. The combined Capsule and LSTM network have comparable performance to state-of-the-art models and about 1/5th the number of parameters, resulting in reduced computational cost.

Paper Nr: 162
Title:

Characteristics of Minimum Variance Beamformer for Frequency and Plane-wave Compounding

Authors:

Ryoya Kozai, Norio Tagawa, Masasumi Yoshizawa and Takasuke Irie

Abstract: Recently, coherent plane-wave compounding (CPWC) that achieves high spatiotemporal resolution has been studied actively as a spatial compounding beamformer. Further, various frequency compounding methods have been proposed for reducing speckle noise. We already proposed the method called frequency and plane-wave compounding minimum variance distortionless response (FPWC-MVDR), which achieves high spatial resolution imaging by simultaneously optimizing frequency and spatial compounding based on minimum variance scheme. In the algorithm of this method, the data-compounded-on-receive MVDR (DCR-MVDR) principle developed for CPWC is extended and applied. In this study, we analyze the features and characteristics of FPWC-MVDR and the weaknesses to be solved in the future through experiments.

Paper Nr: 164
Title:

Single-image Background Removal with Entropy Filtering

Authors:

Chang-Chieh Cheng

Abstract: Background removal is often used for segmentation of the main subject from a photograph. This paper proposes a new method of background removal for a single image. The proposed method uses Shannon entropy to quantify the texture complexity of background and foreground areas. A normalized entropy filter is applied to compute the entropy of each pixel. The pixels can be classified effectively if the entropy distributions of the background and foreground can be distinguished. To optimize performance, the proposed method constructs an image pyramid such that most background pixels can be labeled in a low-resolution image; thus, the computational cost of entropy calculation can be reduced in the image with the original resolution. Connected component labeling is also adopted for denoising to retain the main subject area completely.

Paper Nr: 174
Title:

CAR-DCGAN: A Deep Convolutional Generative Adversarial Network for Compression Artifact Removal in Video Surveillance Systems

Authors:

Miloud Aqqa and Shishir K. Shah

Abstract: Video compression algorithms result in a degradation of frame quality due to their lossy approach to decrease the required bandwidth, thereby reducing the quality of video available for automatic video analysis. These artifacts may introduce undesired noise and complex structures, which remove textures and high-frequency details in video frames. Moreover, they may lead to decreased performance of some core applications in video surveillance systems such as object detectors. To remedy these quality distortions, it is required to restore high-quality videos from their low-quality counterparts without any changes to the existing compression pipelines through a complicated nonlinear 2D transformation. To this end, we devise a fully convolutional residual network for compression artifact removal (CAR-DCGAN) optimized in a patch-based generative adversarial approach (GAN). We show that our model is capable of restoring frames corrupted with complex and unknown distortions with more realistic details than existing methods. Furthermore, we show that CAR-DCGAN can be applied as a pre-processing step for the object detection task in video surveillance systems.

Paper Nr: 184
Title:

Data-set for Event-based Optical Flow Evaluation in Robotics Applications

Authors:

Mahmoud Z. Khairallah, Fabien Bonardi, David Roussel and Samia Bouchafa

Abstract: Event-Based cameras (also known as Dynamic Vision Sensors ”DVS”) have been used extensively in robotics during the last ten years and have proved the ability to solve many problems encountered in this domain. Their technology is very different from conventional cameras which requires rethinking the existing paradigms and reviewing all the classical image processing and computer vision algorithms. We show in this paper how Event-Based cameras are naturally adapted to estimate on the fly scene gradients and hence the visual flow. Our work starts with a complete study of existing event-based optical flow algorithms that are suitable to be integrated into real-time robotics applications. Then, we provide a data-set that includes different scenarios along with a set of visual flow ground-truth. Finally, we propose an evaluation of existing event-based visual flow algorithms using the proposed ground truth data-set.

Paper Nr: 186
Title:

Video Action Classification through Graph Convolutional Networks

Authors:

Felipe F. Costa, Priscila M. Saito and Pedro H. Bugatti

Abstract: Video classification methods have been evolving through proposals based on end-to-end deep learning architectures. Several works have testified that end-to-end models are effective for the learning of intrinsic video features, especially when compared to the handcrafted ones. In general, convolutional neural networks are used for deep learning in videos. Usually, when applied to such contexts, these vanilla deep learning networks cannot identify variations based on temporal information. To do so, memory-based cells (e.g. long-short term memory), or even optical flow techniques are used in conjunction with the convolutional process. However, despite their effectiveness, those methods neglect global analysis, processing only a small quantity of frames in each batch during the learning and inference process. Moreover, they also completely ignore the semantic relationship between different videos that belong to the same context. Thus, the present work aims to fill these gaps by using information grouping concepts and contextual detection through graph-based convolutional neural networks. The experiments show that our method achieves up to 87% of accuracy in a well-known public video dataset.

Paper Nr: 188
Title:

Grocery Recognition in the Wild: A New Mining Strategy for Metric Learning

Authors:

Marco Filax, Tim Gonschorek and Frank Ortmeier

Abstract: Recognizing grocery products at scale is an open issue for computer-vision systems due to their subtle visual differences. Typically the problem is addressed as a classification problem, e.g., by learning a CNN, for which all classes that are to be distinguished need to be known at training time. We instead observe that the products within stores change over time. Sometimes new products are put on shelves, or existing appearances of products are changed. In this work, we demonstrate the use of deep metric learning for grocery recognition, whereby classes during inference are unknown while training. We also propose a new triplet mining strategy that uses all known classes during training while preserving the ability to perform cross-folded validation. We demonstrate the applicability of the proposed mining strategy using different, publicly available real-world grocery datasets. The proposed approach preserves the ability to distinguish previously unseen groceries while increasing the precision by up to 5 percent.

Paper Nr: 208
Title:

Embedding Anatomical Characteristics in 3D Models of Lower-limb Sockets through Statistical Shape Modelling

Authors:

Ana Costa, Daniel Rodrigues, Marina Castro, Sofia Assis and Hélder P. Oliveira

Abstract: Lower limb amputation is a condition affecting millions of people worldwide. Patients are often prescribed with lower limb prostheses to aid their mobility, but these prostheses require frequent adjustments through an iterative and manual process, which heavily depends on patient feedback and on the prosthetist’s experience. New computer-aided design and manufacturing technologies have been emerging as a way to improve the fitting process by creating virtual socket models. Statistical Shape modelling was used to create 3D models of transtibial (TT) and transfemoral (TF) sockets. Their generalization errors were, respectively, 6.8 ± 1.8 mm and 10.5 ± 1.6 mm, while specificity errors were 9.7 ± 0.6 mm and 9.8 ± 0.2 mm. In both models, a visual analysis showed that biomechanically meaningful features were captured: the largest variations found for both types were in the length of the residual limb and in the perimeter variation along the limb. The results obtained proved that statistical shape modelling methods can be applied to TF and TT sockets, with several potential applications in the orthoprosthetic field: generation of new plausible shapes and on-demand socket design adjustments.

Paper Nr: 209
Title:

Embedded Features for 1D CNN-based Action Recognition on Depth Maps

Authors:

Jacek Trelinski and Bogdan Kwolek

Abstract: In this paper we present an algorithm for human action recognition using only depth maps. A convolutional autoencoder and Siamese neural network are trained to learn embedded features, encapsulating the content of single depth maps. Afterwards, statistical features and multichannel 1D CNN features are extracted on multivariate time-series of such embedded features to represent actions on depth map sequences. The action recognition is achieved by voting in an ensemble of one-vs-all weak classifiers. We demonstrate experimentally that the proposed algorithm achieves competitive results on UTD-MHAD dataset and outperforms by a large margin the best algorithms on 3D Human-Object Interaction Set (SYSU 3DHOI).

Paper Nr: 222
Title:

3D Reconstruction of Deformable Objects from RGB-D Cameras: An Omnidirectional Inward-facing Multi-camera System

Authors:

Eva Curto and Helder Araujo

Abstract: This is a paper describing a system made up of several inward-facing cameras able to perform reconstruction of deformable objects through synchronous acquisition of RGBD data. The configuration of the camera system allows the acquisition of 3D omnidirectional images of the objects. The paper describes the structure of the system as well as an approach for the extrinsic calibration, which allows the estimation of the coordinate transformations between the cameras. Reconstruction results are also presented.

Paper Nr: 228
Title:

Processing Attribute Profiles as Scale-series for Remote Sensing Image Classification

Authors:

Melike Ilteralp and Erchan Aptoula

Abstract: Attribute profiles (APs) are among the most prominent “shallow” spatial-spectral pixel description methods, providing multi-scale, flexible and efficient pixel descriptions, even with modest amounts of training data. In this paper, we investigate their collaboration with long short-term memory networks (LSTMs). Our hypothesis is that a profile can be viewed as a “scale-series” and LSTMs can exploit their sequential nature, akin to temporal series. Plus, feeding a deep network with input of already strong descriptive potential (such as APs) can help them produce advanced features more efficiently w.r.t. training from scratch. Moreover, contrary to the state-of-the-art, we report the results of experiments conducted with non-overlapping training and testing sets, highlighting a significant boost of performance through the combined use of APs with LSTMs.

Paper Nr: 236
Title:

Rapid Light Flash Localization in SWIR using Compressed Sensing

Authors:

Andreas Brorsson, Carl Brännlund, David Bergström and David Gustafsson

Abstract: A high-resolution single pixel camera for long range imaging in the short-wave infrared has been evaluated for the detection and localization of transient light flashes. The single pixel camera is based on an InGaAs photodiode with a digital micromirror device operating as a coded aperture. Images are reconstructed using compressed sensing theory, with Walsh-Hadamard pseudo-random measurement matrices and fast Walsh-Hadamard transform for localization. Our results from experiments with light flashes are presented and the potential use of the camera for muzzle flash detection and localization is discussed.

Paper Nr: 238
Title:

Multi-layer Feature Fusion and Selection from Convolutional Neural Networks for Texture Classification

Authors:

Hajer Fradi, Anis Fradi and Jean-Luc Dugelay

Abstract: Deep feature representation in Convolutional Neural Networks (CNN) can act as a set of feature extractors. However, since CNN architectures embed different representations at different abstraction levels, it is not trivial to choose the most relevant layers for a given classification task. For instance, for texture classification, low-level patterns and fine details from intermediate layers could be more relevant than high-level semantic information from top layers (commonly used for generic classification). In this paper, we address this problem by aggregating CNN activations from different convolutional layers and encoding them into a single feature vector after applying a pooling operation. The proposed approach also involves a feature selection step. This process is favorable for the classification accuracy since the influence of irrelevant features is minimized and the final dimension is reduced. The extracted and selected features from multiple layers can be further manageable by a classifier. The proposed approach is evaluated on three challenging datasets, and the results demonstrate the effectiveness of selecting and fusing multi-layer features for texture classification problem. Furthermore, by means of comparisons to other existing methods, we demonstrate that the proposed approach outperforms the state-of-the-art methods with a significant margin.

Paper Nr: 19
Title:

Image-based Road Marking Classification and Vector Data Derivation from Mobile Mapping 3D Point Clouds

Authors:

Johannes Wolf, Tobias Pietz, Rico Richter, Sören Discher and Jürgen Döllner

Abstract: Capturing urban areas and infrastructure for automated analysis processes becomes ever more important. Laserscanning and photogrammetry are used for scanning the environment in highly detailed resolution. In this work, we present techniques for the semantic classification of 3D point clouds from mobile mapping scans of road environments and the detection of road markings. The approach renders 3D point cloud input data into images for which U-Net as an established image recognition convolutional neural network is used for the semantic classification. The results of the classification are projected back into the 3D point cloud. An automated extraction of vector data is applied for detected road markings, generating detailed road marking maps. Different approaches for the vector data generation are used depending on the type of road markings, such as arrows or dashed lines. The automatically generated shape files created by the presented process can be further used in various GIS applications. Our results of the implemented out-of-core techniques show that the approach can efficiently be applied on large datasets of entire cities.

Paper Nr: 20
Title:

Generic User-guided Interaction Paradigm for Precise Post-slice-wise Processing of Tomographic Deep Learning Segmentations Utilizing Graph Cut and Graph Segmentation

Authors:

Gerald A. Zwettler, Werner Backfrieder, Ronald A. Karwoski and David H. Iii

Abstract: State of the art deep learning (DL) manifested in image processing as an accurate segmentation method. Nevertheless, its black-box nature hardly allows user interference. In this paper, we present a generic Graph cut (GC) and Graph segmentation (GS) approach for user-guided interactive post-processing of segmentations resulting from DL. The GC fitness function incorporates both, the original image characteristics and DL segmentation results, combining them with weights optimized by evolution strategy optimization. To allow for accurate user-guided processing, the fore- and background seeds of the Graph cut are automatically selected from the DL segmentations, but implementing effective features for expert input for adaptions of position and topology. The seamless integration of DL with GC/GS leads to marginal trade-off in quality, namely Jaccard (JI) 1.3% for automated GC and JI 0.46% for GS only. Yet, in specific areas where a well-trained DL model may potentially fail, precise adaptions at a low demand for user-interaction become feasible and thus even outperforming the original DL results. The potential of GC/GS is shown running on ground- truth seeds thereby outperforming DL by 0.44% JI for the GC and even by 1.16% JI for the GS. Iterative slice- by-slice progression of the post-processed and improved results keeps the demand for user-interaction low.

Paper Nr: 23
Title:

Relocation with Coverage and Intersection over Union Loss for Target Matching

Authors:

Zejin Lu, Jinqi Liao, Jiyang Lv and Fengjun Chen

Abstract: Target matching is a common task in the field of computer vision, which has a wide range of implements in the fields of target tracking, medical image analysis, robot navigation, etc. The tasks in these scenarios have high requirements for locating accuracy, reliability and robustness, but the existing methods cannot meet these requirements. To improve the algorithm performance in these aspects, a novel practical target matching framework is proposed in this paper. We firstly present a new bounding box regression metric called Coverage-Intersection over Union (Co-IoU) to obtain higher positioning accuracy performance compared to previous bounding regression strategies. Also, a reasonable region validation and filter strategy is proposed to reduce the false positive matches and the Region of Interest (ROI) adjustment and relocation matching strategy are innovatively present to acquire higher locating accuracy. Our experiments show that the proposed framework is more robust, accurate and reliable than the previous relevant algorithms. Besides, Coverage-Intersection over Union Loss and relocation strategy proposed in this paper can significantly improve the performance of the general object detector as well.

Paper Nr: 40
Title:

Modular Facial Expression Recognition using Double Channel DNNs

Authors:

Sujata and Suman K. Mitra

Abstract: Recognizing human expressions is an important task for machines to understand emotional changes in humans. However, the the accurate features that are closely linked to changes in expression are difficult to extract due to the influence of individual differences and variations in emotional intensity. The modular approach presented here imitates the human being’s ability to identify a person with a limited facial part. In this article, we demonstrate experimentally that certain parts of the face, such as the eyes, nose, lips and forehead, contribute more to the recognition of expressions. A combination of two deep neural networks is also proposed to extract the characteristics of the facial images provided. Two preprocessing approaches are implemented, Histogram Equalization (to handle illumination) and Data Augmentation (increasing number of facial images), to restrict the regions used for recognition of the facial expression. Two-channel architecture used for implementation, one channel accepts input as a grayscale face image, processed by VGG16_ft (fine-tuned VGG16), and another channel accepts input as histograms face image. the second order gradients (HSOG), processed from the proposed CNN model and extracts the characteristics accordingly. Then concatenate the characteristics from the two channels. The final recognition result is calculated using the SVM and KNN classifiers. Experimental results indicate that the proposed algorithm is able to recognize six basic facial expressions (happiness, sadness, anger, disgust, fear and surprise) with great precision. Fine tuning is effective for FER activities with a well-trained model if there are not enough samples to collected.

Paper Nr: 87
Title:

Early Defect Detection in Conveyor Belts using Machine Vision

Authors:

Guilherme G. Netto, Bruno N. Coelho, Saul E. Delabrida, Amilton Sinatora, Héctor Azpúrua, Gustavo Pessin, Ricardo R. Oliveira and Andrea C. Bianchi

Abstract: Continuous belt monitoring is of utmost importance since wears on its surface can develop into tears and even rupture. It can causes the interruption of the conveyor, and consequently, loss of capital, or even worse, serious or fatal accidents. This paper proposes a laser-based machine vision method for detecting defects in conveyor belts to solve the monitoring problem. The approach transforms an image of a laser line into a one-dimensional signal, then analyzes it to detect defects, considering that variations in this signal are caused by defects/imperfections on the belt surface. Differently from previous works, the proposed method can identify a defect through a 2D reconstruction of it. The results reveal that the proposed method was capable to detect superficial imperfections in simulated conveyor belt experiments, achieving high values in metrics such as precision and recall.

Paper Nr: 91
Title:

New Challenges of Face Detection in Paintings based on Deep Learning

Authors:

Siwar Bengamra, Olfa Mzoughi, André Bigand and Ezzeddine Zagrouba

Abstract: In this work, we address the problem of face detection from painting images in Tenebrism style, a particular painting style that is characterized by the use of extreme contrast between light and dark. We use Convolutional Neural Networks (CNNs) to tackle this task. In this article, we show that face detection in paintings presents additional challenges as compared to classic face detection from natural images. For this, we present a performance analysis of three CNN architectures, namely, VGG16, ResNet50 and ResNet101, as backbone networks of one of the most popular CNN based object detector, Faster RCNN, to boost-up the face detection performance. This paper describes a collection and annotation of benchmark dataset of Tenebrism paintings. In order to reduce the impact of dataset bias, we propose to evaluate the effect of several data augmentation techniques used to increase variability. Experimental results reveal a detection average precision of 44.19% with ResNet101, while better performances have been achieved 79.48% and 83.94% with VGG16 and ResNet50, respectively.

Paper Nr: 109
Title:

Generation of Privacy-friendly Datasets of Latent Fingerprint Images using Generative Adversarial Networks

Authors:

Stefan Seidlitz, Kris Jürgens, Andrey Makrushin, Christian Kraetzer and Jana Dittmann

Abstract: The restrictions posed by the recent trans-border regulations to the usage of biometric data force researchers in the fields of digitized forensics and biometrics to use synthetic data for development and evaluation of new algorithms. For digitized forensics, we introduce a technique for conversion of privacy-sensitive datasets of real latent fingerprints to "privacy-friendly" datasets of synthesized fingerprints. Privacy-friendly means in our context that the generated fingerprint images cannot be linked to a particular person who provided fingerprints to the original dataset. In contrast to the standard fingerprint generation approach that makes use of mathematical modeling for drawing ridge-line patterns, we propose applying a data-driven approach making use of generative adversarial neural networks (GAN). In our synthesis experiments the performance of three established GAN architectures is examined. The NIST Special Database 27 is exemplary used as a data source of real latent fingerprints. The set of training images is augmented by applying filters from the StirTrace benchmarking tool. The suitability of the generated fingerprint images is checked with the NIST fingerprint image quality tool (NFIQ2). The unlinkability to any original fingerprint is established by evaluating outcomes of the NIST fingerprint matching tool.

Paper Nr: 112
Title:

An Adversarial Training based Framework for Depth Domain Adaptation

Authors:

Jigyasa S. Katrolia, Lars Krämer, Jason Rambach, Bruno Mirbach and Didier Stricker

Abstract: In absence of sufficient labeled training data, it is common practice to resort to synthetic data with readily available annotations. However, some performance gap still exists between deep learning models trained on synthetic versus on real data. Using adversarial training based generative models, it is possible to translate images from synthetic to real domain and train on them easily generalizable models for real-world datasets, but the efficiency of this method is limited in the presence of large domain shifts such as between synthetic and real depth images characterized by depth sensor and scene dependent artifacts in the image. In this paper, we present an adversarial training based framework for adapting depth images from synthetic to real domain. We use a cyclic loss together with an adversarial loss to bring the two domains of synthetic and real depth images closer by translating synthetic images to real domain, and demonstrate the usefulness of synthetic images modified this way for training deep neural networks that can perform well on real images. We demonstrate our method for the application of person detection and segmentation in real-depth images captured in a car for in-cabin person monitoring. We also show through experiments the effect of using target domain image sets captured using different types of depth sensors on this domain adaptation approach.

Paper Nr: 116
Title:

Intel RealSense SR305, D415 and L515: Experimental Evaluation and Comparison of Depth Estimation

Authors:

Francisco Lourenço and Helder Araujo

Abstract: In the last few years, Intel has launched several low-cost RGB-D cameras. Three of these cameras are the SR305, the L415, and the L515. These three cameras are based on different operating principles. The SR305 is based on structured light projection, the D415 is based on stereo based also using the projection of random dots, and the L515 is based on LIDAR. In addition, they all provide RGB images. In this paper, we perform an experimental analysis and comparison of the depth estimation by the three cameras.

Paper Nr: 157
Title:

Imaging Reality and Abstraction an Exploration of Natural and Symbolic Patterns

Authors:

Alexandra B. Albu and George Nagy

Abstract: Understanding visual symbols is a strictly human skill, as opposed to comprehending natural scenes—which is an essential survival skill, common to many species. As an illustration of the natural vs. symbolic dichotomy, selective features are computed for differentiating a satellite photograph from a map of the same geographical region. Images of physical scenes /objects are currently captured in all parts of the electromagnetic spectrum. Symbols, whether produced by man or machine, are almost always imaged in the visible range. Although natural and symbolic images differ in many ways, there is no universal set of differentiating characteristics. With respect to the traditional branches of pattern recognition, it is tempting to suggest that statistical, neural network and genetic/evolutionary pattern recognition methods are eminently suitable for images of scenes and simple symbols, whereas structural and syntactic approaches are best for more complex, composite graphical symbols.

Paper Nr: 171
Title:

Unsupervised Segmentation of Leukocytes Images using Particle Swarm

Authors:

Jocival D. Dias Júnior and André R. Backes

Abstract: Blood smear image analysis is an essential task for many health related issues. Among the many blood structures present in these images, leukocytes play an important role in the detection of many diseases (such as leukemias), which can be detected by the amount, or abnormal aspect, of the leukocytes. To address this problem, this paper presents an unsupervised segmentation method for the nuclear structures in leukocytes. Our method uses color deconvolution to separate the dyes in different channels and a PSO algorithm to estimate an optimal kernel filter to combine local features in different stain channels to emphasize the leukocytes structures so that simple thresholding techniques are able to perform image segmentation. We also used a postprocessing approach based on morphological operators to refine the border of detected structures, thus improving our performance. We performed a comparison with different approaches found in literature using 367 images containing leukocytes and other blood structures and results demonstrated the superiority of our approach in terms of Jaccard index.

Paper Nr: 172
Title:

Non-linear Distortion Recognition in UAVs’ Images using Deep Learning

Authors:

Leandro P. Silva, Jocival D. Dias Jr, Jean B. Santos, João F. Mari, Maurício C. Escarpinati and André R. Backes

Abstract: Unmanned Aerial Vehicles (UAV) have increasingly been used as tools in many tasks present in Precision Agriculture (PA). Due to the particular characteristics of the flight and the UAV equipment, several challenges need to be addressed, such as the presence of non-linear deformations in the captured images. These deformations impair the image registration process so they must be identified to be properly corrected. In this paper, we propose a Convolutional Neural Network (CNN) architecture to classify whether or not a given image has non-linear deformation. We compared our approach with 4 traditional CNNs and the results show that our model achieves has an accuracy similar to the compared CNNs, but with an extremely lower computational cost, which could enable its use in flight time, in a system embedded in the UAV.

Paper Nr: 179
Title:

Asset Detection in Railroad Environments using Deep Learning-based Scanline Analysis

Authors:

Johannes Wolf, Rico Richter and Jürgen Döllner

Abstract: This work presents an approach for the automated detection of railroad assets in 3D point clouds from mobile mapping LiDAR scans using established convolutional neural networks for image analysis. It describes how images of individual scan lines can be generated from 3D point clouds. In these scan lines, objects such as tracks, signal posts, and axle counters can be detected using artificial neural networks for image analysis, previously trained on ground-truth data. The recognition results can then be transferred back to the 3D point cloud as a semantic classification result, or they are used to generate geometry or map data for further processing in GIS applications. Using this approach, trained objects can be found with high automation. Challenges such as varying point density, different data characteristics of scanning devices, and the massive amount of data can be overcome with this approach.

Paper Nr: 180
Title:

As Plain as the Nose on Your Face?

Authors:

Peter C. Varley, Stefania Cristina, Alexandra Bonnici and Kenneth P. Camilleri

Abstract: We present an investigation into locating nose tips in 2D images of human faces. Our objective is conference-room gaze-tracking, in which a presenter can control a presentation or demonstration by gaze from a distance in the range 2m to 10m. In a first step towards this, we here consider faces in the range 150cm to 300cm. Head pose is the major contributing component of gaze direction, and nose tip position within the image of the face is a strong clue to head pose. To facilitate detection of nose tips, we have implemented a combination of two Haar cascades (one for frontal noses and one for profile noses) with a lower failure rate than existing cascades, and we have examined a number of ”hand-crafted ferns” for their potential to locate the nose tip within the nose-like regions returned by our Haar cascades

Paper Nr: 194
Title:

Segmentation of Agricultural Images using Vegetation Indices

Authors:

Jean B. Santos, Jocival D. Dias Junior, André R. Backes and Maurício C. Escarpinati

Abstract: Identifying and segmenting plants from the background in agricultural images is of great importance for precision agriculture. It serves as a basis for several tasks such as identification of planting lines, identification of weed plants, agricultural automation, among others. Given this importance, in this paper, we evaluated the application of five vegetation indices for RGB images together with two binarization techniques for the plant/background segmentation process. The results showed promising performance in all evaluated indices. It was also possible to identify a relationship between the performance obtained in each index and the capture conditions in each dataset.

Paper Nr: 195
Title:

Single Image Super-resolution using Vectorization and Texture Synthesis

Authors:

Kaoning Hu, Dongeun Lee and Tianyang Wang

Abstract: Image super-resolution is a very useful tool in science and art. In this paper, we propose a novel method for single image super-resolution that combines image vectorization and texture synthesis. Image vectorization is the conversion from a raster image to a vector image. While image vectorization algorithms can trace the fine edges of images, they will sacrifice color and texture information. In contrast, texture synthesis techniques, which have been previously used in image super-resolution, can reasonably create high-resolution color and texture information, except that they sometimes fail to trace the edges of images correctly. In this work, we adopt the image vectorization to the edges of the original image, and the texture synthesis based on the Kolmogorov–Smirnov test (KS test) to the non-edge regions of the original image. The goal is to generate a plausible, visually pleasing detailed higher resolution version of the original image. In particular, our method works very well on the images of natural animals.

Paper Nr: 196
Title:

Cell Image Segmentation by Feature Random Enhancement Module

Authors:

Takamasa Ando and Kazuhiro Hotta

Abstract: It is important to extract good features using an encoder to realize semantic segmentation with high accuracy. Although loss function is optimized in training deep neural network, far layers from the layers for computing loss function are difficult to train. Skip connection is effective for this problem but there are still far layers from the loss function. In this paper, we propose the Feature Random Enhancement Module which enhances the features randomly in only training. By emphasizing the features at far layers from loss function, we can train those layers well and the accuracy was improved. In experiments, we evaluated the proposed module on two kinds of cell image datasets, and our module improved the segmentation accuracy without increasing computational cost in test phase.

Paper Nr: 227
Title:

OFFSED: Off-Road Semantic Segmentation Dataset

Authors:

Peter Neigel, Jason Rambach and Didier Stricker

Abstract: Over the last decade, improvements in neural networks have facilitated substantial advancements in automated driver assistance systems. In order to manage navigating its surroundings reliably and autonomously, self- driving vehicles need to be able to infer semantic information of the environment. Large parts of the research corpus focus on private passenger cars and cargo trucks, which share the common environment of paved roads, highways and cities. Industrial vehicles like tractors or excavators however make up a substantial share of the total number of motorized vehicles globally while operating in fundamentally different environments. In this paper, we present an extension to our previous Off-Road Pedestrian Detection Dataset (OPEDD) that extends the ground truth data of 203 images to full image semantic segmentation masks which assign one of 19 classes to every pixel. The selection of images was done in a way that captures the whole range of environments and human poses depicted in the original dataset. In addition to pixel labels, a few selected countable classes also come with instance identifiers. This allows for the use of the dataset in instance and panoptic segmentation tasks.

Area 2 - Mobile and Egocentric Vision for Humans and Robots

Full Papers
Paper Nr: 7
Title:

A Surface and Appearance-based Next Best View System for Active Object Recognition

Authors:

Pourya Hoseini, Shuvo K. Paul, Mircea Nicolescu and Monica Nicolescu

Abstract: Active vision represents a set of techniques that attempt to incorporate new visual data by employing camera motion. Object recognition is one of the main areas where active vision can be particularly beneficial. In cases where recognition is uncertain, new perspectives of an object can help in improving the quality of observation and potentially the recognition. A key question, however, is from where to look at the object. Current approaches mostly consider creating an occupancy grid of known object voxels or imagining the entire object shape and appearance to determine the next camera pose. Another current trend is to show every possible object view to the vision system during the training time. These methods typically require multiple observations or considerable training data and time to effectively function. In this paper, a next best view system is proposed that takes into account only the initial surface shape and appearance of the object, and subsequently determines the next camera pose. Therefore, it is a single-shot method without the need to have any specifically made dataset for the training. Experimental validations prove the feasibility of the proposed method in finding good viewpoints while showing significant improvements in recognition performance.

Paper Nr: 51
Title:

Multi-view Planarity Constraints for Skyline Estimation from UAV Images in City Scale Urban Environments

Authors:

Ayyappa S. Thatavarthy, Tanu Sharma, Harshit Sankhla, Mukul Khanna and K. M. Krishna

Abstract: It is critical for aerial robots flying in city scale urban environments to make very quick estimates of a building depth with respect to itself. It should be done in a matter of few views to navigate itself, avoiding collisions with such a towering structure. As such, no one has attacked this problem. We bring together several modules combining deep learning and 3D vision to showcase a quick reconstruction in a few views. We exploit the inherent planar structure in the buildings (facades, windows) for this purpose. We evaluate the efficacy of our pipeline with various constraints and errors from multi-view geometry using ablation studies. We then retrieve the skyline of the buildings in synthetic as well as real-world scenes.

Paper Nr: 56
Title:

Boosting Self-localization with Graph Convolutional Neural Networks

Authors:

Takeda Koji and Tanaka Kanji

Abstract: Scene graph representation has recently merited attention for being flexible and descriptive where visual robot self-localization is concerned. In a typical self-localization application, the objects, object features and object relationships of the environment map are projected as nodes, node features and edges, respectively, on to the scene graph and subsequently mapped to a query scene graph using a graph matching engine. However, the computational, storage, and communication overhead costs of such a system are directly proportional to the number of feature dimensionalities of the graph nodes, often significant in large-scale applications. In this study, we demonstrate the feasibility of a graph convolutional neural network (GCN) to train and predict alongside a graph matching engine. However, visual features do not often translate well into graph features in modern graph convolution models, thereby affecting their performance. Therefore, we developed a novel knowledge transfer framework that introduces an arbitrary self-localization model as the teacher to train the GCN-based self-localization system i.e., the student. The framework, additionally, facilitated lightweight storage and communication by formulating the compact output signals from the teacher model as training data. Results on the Oxford RobotCar datasets reveal that the proposed method outperforms existing comparative methods and teacher self-localization systems.

Paper Nr: 119
Title:

Practical Auto-calibration for Spatial Scene-understanding from Crowdsourced Dashcamera Videos

Authors:

Hemang Chawla, Matti Jukola, Shabbir Marzban, Elahe Arani and Bahram Zonooz

Abstract: Spatial scene-understanding, including dense depth and ego-motion estimation, is an important problem in computer vision for autonomous vehicles and advanced driver assistance systems. Thus, it is beneficial to design perception modules that can utilize crowdsourced videos collected from arbitrary vehicular onboard or dashboard cameras. However, the intrinsic parameters corresponding to such cameras are often unknown or change over time. Typical manual calibration approaches require objects such as a chessboard or additional scene-specific information. On the other hand, automatic camera calibration does not have such requirements. Yet, the automatic calibration of dashboard cameras is challenging as forward and planar navigation results in critical motion sequences with reconstruction ambiguities. Structure reconstruction of complete visual- sequences that may contain tens of thousands of images is also computationally untenable. Here, we propose a system for practical monocular onboard camera auto-calibration from crowdsourced videos. We show the effectiveness of our proposed system on the KITTI raw, Oxford RobotCar, and the crowdsourced D2-City datasets in varying conditions. Finally, we demonstrate its application for accurate monocular dense depth and ego-motion estimation on uncalibrated videos.

Paper Nr: 234
Title:

Dual CNN-based Face Tracking Algorithm for an Automated Infant Monitoring System

Authors:

Cheng Li, Genyu Song, A. Pourtaherian and P. H. N. de With

Abstract: Face tracking is important for designing a surveillance system when facial features are used as main descriptors. In this paper, we propose an on-line updating face tracking method, which is not only suitable for specific tasks, such as infant monitoring, but also a generic human-machine interaction application where face recognition is required. The tracking method is based on combining the architecture of the GOTURN and YOLO tiny face detector, which enables the tracking model to be updated over time. Tracking of objects is realized by analyzing two neighboring frames through a deep neural network. On-line updating is achieved by comparing the tracking result and face detection obtained from the YOLO tiny face detector. The experimental results have shown that our proposed tracker achieves an AUC of 97.9% for precision plot and an AUC of 91.8% for success plot, which outperforms other state-of-the-art tracking methods when used in the infant monitoring application.

Short Papers
Paper Nr: 50
Title:

Sewer Defect Classification using Synthetic Point Clouds

Authors:

Joakim B. Haurum, Moaaz J. Allahham, Mathias S. Lynge, Kasper S. Henriksen, Ivan A. Nikolov and Thomas B. Moeslund

Abstract: Sewer pipes are currently manually inspected by trained inspectors, making the process prone to human errors, which can be potentially critical. There is therefore a great research and industry interest in automating the sewer inspection process. Previous research have been focused on working with 2D image data, similar to how inspections are currently conducted. There is, however, a clear potential for utilizing recent advances within 3D computer vision for this task. In this paper we investigate the feasibility of applying two modern deep learning methods, DGCNN and PointNet, on a new publicly available sewer point cloud dataset. As point cloud data from real sewers is scarce, we investigate using synthetic data to bootstrap the training process. We investigate four data scenarios, and find that training on synthetic data and fine-tune on real data gives the best results, increasing the metrics by 6-10 percentage points for the best model. Data and code is available at https://bitbucket.org/aauvap/sewer3dclassification.

Paper Nr: 64
Title:

Learning to Correct Reconstructions from Multiple Views

Authors:

Ștefan Săftescu and Paul Newman

Abstract: This paper is about reducing the cost of building good large-scale reconstructions post-hoc. This is an important consideration for survey vehicles which are equipped with sensors which offer mixed fidelity or are restricted by road rules to high-speed traversals. We render 2D views of an existing, lower-quality, reconstruction and train a convolutional neural network (CNN) that refines inverse-depth to match to a higher-quality reconstruction. Since the views that we correct are rendered from the same reconstruction, they share the same geometry, so overlapping views complement each other. We impose a loss during training which guides predictions on neighbouring views to have the same geometry and has been shown to improve performance. In contrast to previous work, which corrects each view independently, we also make predictions on sets of neighbouring views jointly. This is achieved by warping feature maps between views and thus bypassing memory-intensive computation. We make the observation that features in the feature maps are viewpoint-dependent, and propose a method for transforming features with dynamic filters generated by a multi-layer perceptron from the relative poses between views. In our experiments we show that this last step is necessary for successfully fusing feature maps between views.

Paper Nr: 66
Title:

Integration of Multiple RGB-D Data of a Deformed Clothing Item into Its Canonical Shape

Authors:

Yasuyo Kita, Ichiro Matsuda and Nobuyuki Kita

Abstract: To recognize a clothing item so that it can be handled automatically, we propose a method that integrates multiple partial views of the item into its canonical shape, that is, the shape when it is flattened on a planar table. When a clothing item is held by a robot hand, only part of the deformed item can be seen from one observation, which makes the recognition of the item very difficult. To remove the effect of deformation, we first virtually flatten the deformed clothing surface based on the geodesic distances between surface points, which equal their two-dimensional distances when the surface is flattened on a plane. The integration of multiple views is performed on this flattened image plane by aligning flattened views obtained from different observations. Appropriate view directions for efficient integration are also automatically determined. The experimental results using both synthetic and real data are demonstrated.

Paper Nr: 100
Title:

Visual-based Global Localization from Ceiling Images using Convolutional Neural Networks

Authors:

Philip Scales, Mykhailo Rimel and Olivier Aycard

Abstract: The problem of global localization consists in determining the position of a mobile robot inside its environment without any prior knowledge of its position. Existing approaches for indoor localization present drawbacks such as the need to prepare the environment, dependency on specific features of the environment, and high quality sensor and computing hardware requirements. We focus on ceiling-based localization that is usable in crowded areas and does not require expensive hardware. While the global goal of our research is to develop a complete robust global indoor localization framework for a wheeled mobile robot, in this paper we focus on one part of this framework – being able to determine a robot’s pose (2-DoF position plus orientation) from a single ceiling image. We use convolutional neural networks to learn the correspondence between a single image of the ceiling of the room, and the mobile robot’s pose. We conduct experiments in real-world indoor environments that are significantly larger than those used in state of the art learning-based 6-DoF pose estimation methods. In spite of the difference in environment size, our method yields comparable accuracy.

Paper Nr: 154
Title:

Unsupervised Gaze Prediction in Egocentric Videos by Energy-based Surprise Modeling

Authors:

Sathyanarayanan N. Aakur and Arunkumar Bagavathi

Abstract: Egocentric perception has grown rapidly with the advent of immersive computing devices. Human gaze prediction is an important problem in analyzing egocentric videos and has primarily been tackled through either saliency-based modeling or highly supervised learning. We quantitatively analyze the generalization capabilities of supervised, deep learning models on the egocentric gaze prediction task on unseen, out-of-domain data. We find that their performance is highly dependent on the training data and is restricted to the domains specified in the training annotations. In this work, we tackle the problem of jointly predicting human gaze points and temporal segmentation of egocentric videos without using any training data. We introduce an unsupervised computational model that draws inspiration from cognitive psychology models of event perception. We use Grenander’s pattern theory formalism to represent spatial-temporal features and model surprise as a mechanism to predict gaze fixation points. Extensive evaluation on two publicly available datasets - GTEA and GTEA+ datasets-shows that the proposed model can significantly outperform all unsupervised baselines and some supervised gaze prediction baselines. Finally, we show that the model can also temporally segment egocentric videos with a performance comparable to more complex, fully supervised deep learning baselines.

Paper Nr: 176
Title:

Detecting Anomalies from Human Activities by an Autonomous Mobile Robot based on “Fast and Slow” Thinking

Authors:

Muhammad F. Fadjrimiratno, Yusuke Hatae, Tetsu Matsukawa and Einoshin Suzuki

Abstract: In this paper, we propose an anomaly detection method from human activities by an autonomous mobile robot which is based on “Fast and Slow Thinking”. Our previous method employes deep captioning and detects anomalous image regions based on image visual features, caption features, and coordinate features. However, detecting anomalous image region pairs is a more challenging problem due to the larger number of candidates. Moreover, realizing reminiscence, which represents re-checking past, similar examples to cope with overlooking, is another challenge for a robot operating in real-time. Inspired by “Fast and Slow Thinking” from the dual process theory, we achieve detection of these kinds of anomalies in real-time onboard an autonomous mobile robot. Our method consists of a fast module which models caption-coordinate features to detect single-region anomalies, and a slow module which models image visual features and overlapping image regions to detect also neighboring-region anomalies. The reminiscence is triggered by the fast module as a result of its anomaly detection and the slow module seeks for single-region anomalies in recent images. Experiments with a real robot platform show the superiority of our method to the baseline methods in terms of recall, precision, and AUC.

Paper Nr: 82
Title:

Fiducial Points-supported Object Pose Tracking on RGB Images via Particle Filtering with Heuristic Optimization

Authors:

Mateusz Majcher and Bogdan Kwolek

Abstract: We present an algorithm for tracking 6D pose of the object in a sequence of RGB images. The images are acquired by a calibrated camera. A particle filter is utilized to estimate the posterior probability distribution of the object poses. The probabilistic observation model is built on the projected 3D model onto image and then matching the rendered object with the segmented object. It is determined using object silhouette and distance transform-based edge scores. A hypothesis about 6D object pose that is calculated on the basis of object keypoints and the PnP algorithm is included in the probability distribution. A k-means++ algorithm is then executed on multi-modal probability distribution to determine modes. A multi-swarm particle swarm optimization is executed afterwards to find finest modes in the probability distribution together with the best pose. The object of interest is segmented by an U-Net neural network. Eight fiducial points of the object are determined by a neural network. A data generator employing 3D object models has been developed to synthesize photorealistic images with ground-truth data for training neural networks both for object segmentation and estimation of keypoints. The 6D object pose tracker has been evaluated both on synthetic and real images. We demonstrate experimentally that object pose hypotheses calculated on the basis of fiducial points and the PnP algorithm lead to considerable improvements in tracking accuracy.

Paper Nr: 201
Title:

Unsupervised Domain Adaptation for 6DOF Indoor Localization

Authors:

Daniele Di Mauro, Antonino Furnari, Giovanni Signorello and Giovanni M. Farinella

Abstract: Visual Localization is gathering more and more attention in computer vision due to the spread of wearable cameras (e.g. smart glasses) and to the increase of general interest in autonomous vehicles and robots. Unfortunately, current localization algorithms rely on large amounts of labeled training data collected in the specific target environment in which the system needs to work. Data collection and labeling in this context is difficult and time-consuming. Moreover, the process has to be repeated when the system is adapted to a new environment. In this work, we consider a scenario in which the target environment has been scanned to obtain a 3D model of the scene suitable to generate large quantities of synthetic data automatically paired with localization labels. We hence investigate the use of Unsupervised Domain Adaptation techniques exploiting labeled synthetic data and unlabeled real data to train localization algorithms. To carry out the study, we introduce a new dataset composed of synthetic and real images labeled with their 6-DOF poses collected in four different indoor rooms which is available at https://iplab.dmi.unict.it/EGO-CH-LOC-UDA. A new method based on self-supervision and attention modules is hence proposed and tested on the proposed dataset. Results show that our method improves over baselines and state-of-the-art algorithms tackling similar domain adaptation tasks.

Area 3 - Image and Video Understanding

Full Papers
Paper Nr: 8
Title:

Clothing Parsing using Extended U-Net

Authors:

Gabriela Vozáriková, Richard Staňa and Gabriel Semanišin

Abstract: This paper focuses on the task of clothing parsing, which is a special case of the more general object segmentation task well known in the field of computer vision. Each pixel is to be assigned to one of the clothing categories or background. Due to complexity of the problem and lack of data (until recently) performance of the modern state-of-the-art clothing parsing models expressed in terms of mean Intersection over Union metric (IoU) does not exceed 55%. In this paper, we propose a novel multitask network by extending fully-convolutional neural network U-Net with two side branches – one solves a multilabel classification task and the other predicts bounding boxes of clothing instances. We trained this network using a large-scaled iMaterialist dataset (Visipedia, 2019), which we refined. Compared to well performing segmentation architectures FPN, DeepLabV3, DeepLabV3+ and plain U-Net, our model achieves the best experimental results.

Paper Nr: 14
Title:

Lightweight SSD: Real-time Lightweight Single Shot Detector for Mobile Devices

Authors:

Shi Guo, Yang Liu, Yong Ni and Wei Ni

Abstract: Computer vision has a wide range of applications, and the current demand for intelligent embedded terminals is increasing. However, most research on CNN (Convolutional Neural Network) detectors did not consider mobile devices' limited computation and did not specifically design networks for mobile devices. To achieve an efficient object detector for mobile devices, we propose a lightweight detector named Lightweight SSD. In the backbone part, we design our MBlitenet backbone based on the Attentive linear inverted residual bottleneck to enhance the backbone's feature extraction capability while achieving the lightweight requirements. In the detection neck part, we propose an efficient feature fusion network CFPN. Two innovative and useful Bag of freebies named BLL loss (Both Localization Loss) and GrayMixRGB are applied to the Lightweight SSD’s training procedure. They can further improve detector capabilities and efficiency without increasing the inference computation. As a result, Lightweight SSD achieves 74.4 mAP (mean Average Precision) with only 4.86M parameters on PASCAL VOC, being 0.2x smaller yet still more accurate 3.5 mAP than the previous best lightweight detector. To our knowledge, the Lightweight SSD is the state-of-the-art real-time lightweight detector on mobile devices with the edge Application-specific integrated circuit (ASIC). Source Code will be released after paper publication.

Paper Nr: 17
Title:

Reduction in Communication via Image Selection for Homomorphic Encryption-based Privacy-protected Person Re-identification

Authors:

Shogo Fukuda, Masashi Nishiyama and Yoshio Iwai

Abstract: We propose a method for reducing the amount of communication by selecting pedestrian images for privacy-protected person re-identification. Recently, it has become necessary to pay attention to how features corresponding to personal information can be protected. Owing to homomorphic encryption, which enables a distance between features to be computed without decryption, our method can use a cloud server on a public network while protecting personal information. However, we must consider the problem of the large amount of communication that occurs between camera clients and the server when homomorphic encryption is used. Our method aims to reduce the amount of this communication by selecting appropriate pedestrian images using reference leg postures. In our experiment, we confirmed that the amount of communication dynamically reduces without significant degradation in the accuracy of privacy-protected person re-identification with homomorphic encryption.

Paper Nr: 27
Title:

Automatically Generating Websites from Hand-drawn Mockups

Authors:

João S. Ferreira, André Restivo and Hugo S. Ferreira

Abstract: Designers often use physical hand-drawn mockups to convey their ideas to stakeholders. Unfortunately, these sketches do not depict the exact final look and feel of web pages, and communication errors will often occur, resulting in prototypes that do not reflect the stakeholder’s vision. Multiple suggestions exist to tackle this problem, mainly in the translation of visual mockups to prototypes. Some authors propose end-to-end solutions by directly generating the final code from a single (black-box) Deep Neural Network. Others propose the use of object detectors, providing more control over the acquired elements but missing out on the mockup’s layout. Our approach provides a real-time solution that explores: (1) how to achieve a large variety of sketches that would look indistinguishable from something a human would draw, (2) a pipeline that clearly separates the different responsibilities of extracting and constructing the hierarchical structure of a web mockup, (3) a methodology to segment and extract containers from mockups, (4) the usage of in-sketch annotations to provide more flexibility and control over the generated artifacts, and (5) an assessment of the synthetic dataset impact in the ability to recognize diagrams actually drawn by humans. We start by presenting an algorithm that is capable of generating synthetic mockups. We trained our model (N=8400, Epochs=400) and subsequently fine-tuned it (N=74, Epochs=100) using real human-made diagrams. We accomplished a mAP of 95.37%, with 90% of the tests taking less than 430ms on modest commodity hardware (≈ 2.3fps). We further provide an ablation study with well-known object detectors to evaluate the synthetic dataset in isolation, showing that the generator achieves a mAP score of 95%, ≈1.5× higher than training using hand-drawn mockups alone

Paper Nr: 30
Title:

On the Transferability of Winning Tickets in Non-natural Image Datasets

Authors:

Matthia Sabatelli, Mike Kestemont and Pierre Geurts

Abstract: We study the generalization properties of pruned models that are the winners of the lottery ticket hypothesis on photorealistic datasets. We analyse their potential under conditions in which training data is scarce and comes from a not-photorealistic domain. More specifically, we investigate whether pruned models that are found on the popular CIFAR-10/100 and Fashion-MNIST datasets, generalize to seven different datasets coming from the fields of digital pathology and digital heritage. Our results show that there are significant benefits in training sparse architectures over larger parametrized models, since in all of our experiments pruned networks significantly outperform their larger unpruned counterparts. These results suggest that winning initializations do contain inductive biases that are generic to neural networks, although, as reported by our experiments on the biomedical datasets, their generalization properties can be more limiting than what has so far been observed in the literature.

Paper Nr: 54
Title:

Point Cloud Upsampling and Normal Estimation using Deep Learning for Robust Surface Reconstruction

Authors:

Rajat Sharma, Tobias Schwandt, Christian Kunert, Steffen Urban and Wolfgang Broll

Abstract: The reconstruction of real-world surfaces is on high demand in various applications. Most existing reconstruction approaches apply 3D scanners for creating point clouds which are generally sparse and of low density. These points clouds will be triangulated and used for visualization in combination with surface normals estimated by geometrical approaches. However, the quality of the reconstruction depends on the density of the point cloud and the estimation of the surface normals. In this paper, we present a novel deep learning architecture for point cloud upsampling that enables subsequent stable and smooth surface reconstruction. A noisy point cloud of low density with corresponding point normals is used to estimate a point cloud with higher density and appendant point normals. To this end, we propose a compound loss function that encourages the network to estimate points that lie on a surface including normals accurately predicting the orientation of the surface. Our results show the benefit of estimating normals together with point positions. The resulting point cloud is smoother, more complete, and the final surface reconstruction is much closer to ground truth.

Paper Nr: 72
Title:

Building Synthetic Simulated Environments for Configuring and Training Multi-camera Systems for Surveillance Applications

Authors:

Nerea Aranjuelo, Jorge García, Luis Unzueta, Sara García, Unai Elordi and Oihana Otaegui

Abstract: Synthetic simulated environments are gaining popularity in the Deep Learning Era, as they can alleviate the effort and cost of two critical tasks to build multi-camera systems for surveillance applications: setting up the camera system to cover the use cases and generating the labeled dataset to train the required Deep Neural Networks (DNNs). However, there are no simulated environments ready to solve them for all kind of scenarios and use cases. Typically, ‘ad hoc’ environments are built, which cannot be easily applied to other contexts. In this work we present a methodology to build synthetic simulated environments with sufficient generality to be usable in different contexts, with little effort. Our methodology tackles the challenges of the appropriate parameterization of scene configurations, the strategies to generate randomly a wide and balanced range of situations of interest for training DNNs with synthetic data, and the quick image capturing from virtual cameras considering the rendering bottlenecks. We show a practical implementation example for the detection of incorrectly placed luggage in aircraft cabins, including the qualitative and quantitative analysis of the data generation process and its influence in a DNN training, and the required modifications to adapt it to other surveillance contexts.

Paper Nr: 84
Title:

CUPR: Contrastive Unsupervised Learning for Person Re-identification

Authors:

Khadija Khaldi and Shishir K. Shah

Abstract: Most of the current person re-identification (Re-ID) algorithms require a large labeled training dataset to obtain better results. For example, domain adaptation-based approaches rely heavily on limited real-world data to alleviate the problem of domain shift. However, such assumptions are impractical and rarely hold, since the data is not freely accessible and require expensive annotation. To address this problem, we propose a novel pure unsupervised learning approach using contrastive learning (CUPR). Our framework is a simple iterative approach that learns strong high-level features from raw pixels using contrastive learning and then performs clustering to generate pseudo-labels. We demonstrate that CUPR outperforms the unsupervised and semi-supervised state-of-the-art methods on Market-1501 and DukeMTMC-reID datasets.

Paper Nr: 89
Title:

Latent Video Transformer

Authors:

Ruslan Rakhimov, Denis Volkhonskiy, Alexey Artemov, Denis Zorin and Evgeny Burnaev

Abstract: The video generation task can be formulated as a prediction of future video frames given some past frames. Recent generative models for videos face the problem of high computational requirements. Some models require up to 512 Tensor Processing Units for parallel training. In this work, we address this problem via modeling the dynamics in a latent space. After the transformation of frames into the latent space, our model predicts latent representation for the next frames in an autoregressive manner. We demonstrate the performance of our approach on BAIR Robot Pushing and Kinetics-600 datasets. The approach tends to reduce requirements to 8 Graphical Processing Units for training the models while maintaining comparable generation quality.

Paper Nr: 95
Title:

Feature Map Upscaling to Improve Scale Invariance in Convolutional Neural Networks

Authors:

Dinesh Kumar and Dharmendra Sharma

Abstract: Efforts made by computer scientists to model the visual system has resulted in various techniques from which the most notable has been the Convolutional Neural Network (CNN). Whilst the ability to recognise an object in various scales is a trivial task for the human visual system, it remains a challenge for CNNs to achieve the same behaviour. Recent physiological studies reveal the visual system uses global-first response strategy in its recognition function, that is the visual system processes a wider area from a scene for its recognition function. This theory provides the potential for using global features to solve transformation invariance problems in CNNs. In this paper, we use this theory to propose a global-first feature extraction model called Stacked Filter CNN (SFCNN) to improve scale-invariant classification of images. In SFCNN, to extract features from spatially larger areas of the target image, we develop a trainable feature extraction layer called Stacked Filter Convolutions (SFC). We achieve this by creating a convolution layer with a pyramid of stacked filters of different sizes. When convolved with an input image the outputs are feature maps of different scales which are then upsampled and used as global features. Our results show that by integrating the SFC layer within a CNN structure, the network outperforms traditional CNN on classification of scaled color images. Experiments using benchmark datasets indicate potential effectiveness of our model towards improving scale invariance in CNN networks.

Paper Nr: 102
Title:

Image Synthesisation and Data Augmentation for Safe Object Detection in Aircraft Auto-landing System

Authors:

Najda Vidimlic, Alexandra Levin, Mohammad Loni and Masoud Daneshtalab

Abstract: The feasibility of deploying object detection to interpret the environment is questioned in several mission-critical applications leading to raised concerns about the ability of object detectors in providing reliable and safe predictions of the operational environment, regardless of weather and light conditions. The lack of a comprehensive dataset, which causes class imbalance and detection difficulties of hard examples, is one of the main reasons of accuracy loss in attitude safe object detection. Data augmentation, as an implicit regular- isation technique, has been shown to significantly improve object detection by increasing both the diversity and the size of the training dataset. Despite the success of data augmentation in various computer vision tasks, applying data augmentation techniques to improve safety has not been sufficiently addressed in the literature. In this paper, we leverage a set of data augmentation techniques to improve the safety of object detection. The aircraft in-flight image data is used to evaluate the feasibility of our proposed solution in real-world safety-required scenarios. To achieve our goal, we first generate a training dataset by synthesising the images collected from in-flight recordings. Next, we augment the generated dataset to cover real weather and lighting changes. Introduction of artificially produced distortions is also known as corruptions and has since recently been an approach to enrich the dataset. The introduction of corruptions, as augmentations of weather and luminance in combination with the introduction of artificial artefacts, is done as an approach to achieve a comprehensive representation of an aircraft’s operational environment. Finally, we evaluate the impact of data augmentation on the studied dataset. Faster R-CNN with ResNet-50-FPN was used as an object detector for the experiments. An AP@[IoU=.5:.95] score of 50.327% was achieved with the initial setup, while exposure to altered weather and lighting conditions yielded an 18.1% decrease. The introduction of the conditions into the training set led to a 15.6% increase in comparison to the score achieved from exposure to the conditions.

Paper Nr: 104
Title:

A Human Ear Reconstruction Autoencoder

Authors:

Hao Sun, Nick Pears and Hang Dai

Abstract: The ear, as an important part of the human head, has received much less attention compared to the human face in the area of computer vision. Inspired by previous work on monocular 3D face reconstruction using an autoencoder structure to achieve self-supervised learning, we aim to utilise such a framework to tackle the 3D ear reconstruction task, where more subtle and difficult curves and features are present on the 2D ear input images. Our Human Ear Reconstruction Autoencoder (HERA) system predicts 3D ear poses and shape parameters for 3D ear meshes, without any supervision to these parameters. To make our approach cover the variance for in-the-wild images, even grayscale images, we propose an in-the-wild ear colour model. The constructed end-to-end self-supervised model is then evaluated both with 2D landmark localisation performance and the appearance of the reconstructed 3D ears.

Paper Nr: 129
Title:

Few-shot Linguistic Grounding of Visual Attributes and Relations using Gaussian Kernels

Authors:

Daniel Koudouna and Kasim Terzić

Abstract: Understanding complex visual scenes is one of fundamental problems in computer vision, but learning in this domain is challenging due to the inherent richness of the visual world and the vast number of possible scene configurations. Current state of the art approaches to scene understanding often employ deep networks which require large and densely annotated datasets. This goes against the seemingly intuitive learning abilities of humans and our ability to generalise from few examples to unseen situations. In this paper, we propose a unified framework for learning visual representation of words denoting attributes such as “blue” and relations such as “left of” based on Gaussian models operating in a simple, unified feature space. The strength of our model is that it only requires a small number of weak annotations and is able to generalize easily to unseen situations such as recognizing object relations in unusual configurations. We demonstrate the effectiveness of our model on the predicate detection task. Our model is able to outperform the state of the art on this task in both the normal and zero-shot scenarios, while training on a dataset an order of magnitude smaller.

Paper Nr: 134
Title:

Detecting Object Defects with Fusioning Convolutional Siamese Neural Networks

Authors:

Amr M. Nagy and László Czúni

Abstract: Recently, the combination of deep learning algorithms with visual inspection technology allows differentiating anomalies in objects mimicking human visual inspection. While it offers precise and persistent monitoring with a minimum amount of human activity but to apply the same solution to a wide variety of defect types is challenging. In this paper, a new convolutional siamese neural model is presented to recognize different types of defects. One advantage of the proposed convolutional siamese neural network is that it can be used for new object types without re-training with much better performance than other siamese networks: it can generalize the knowledge of defect types and can apply it to new object classes. The proposed approach is tested with good results on two different data sets: one contains traffic signs of different types and different distortions, the other is a set of metal disk-shape castings with and without defects.

Paper Nr: 140
Title:

Towards Real-time Object Recognition and Pose Estimation in Point Clouds

Authors:

Marlon Marcon, Olga P. Bellon and Luciano Silva

Abstract: Object recognition and 6DoF pose estimation are quite challenging tasks in computer vision applications. Despite efficiency in such tasks, standard methods deliver far from real-time processing rates. This paper presents a novel pipeline to estimate a fine 6DoF pose of objects, applied to realistic scenarios in real-time. We split our proposal into three main parts. Firstly, a Color feature classification leverages the use of pre-trained CNN color features trained on the ImageNet for object detection. A Feature-based registration module conducts a coarse pose estimation, and finally, a Fine-adjustment step performs an ICP-based dense registration. Our proposal achieves, in the best case, an accuracy performance of almost 83% on the RGB-D Scenes dataset. Regarding processing time, the object detection task is done at a frame processing rate up to 90 FPS, and the pose estimation at almost 14 FPS in a full execution strategy. We discuss that due to the proposal’s modularity, we could let the full execution occurs only when necessary and perform a scheduled execution that unlocks real-time processing, even for multitask situations.

Paper Nr: 144
Title:

A Neural Network with Adversarial Loss for Light Field Synthesis from a Single Image

Authors:

Simon Evain and Christine Guillemot

Abstract: This paper describes a lightweight neural network architecture with an adversarial loss for generating a full light field from one single image. The method is able to estimate disparity maps and automatically identify occluded regions from one single image thanks to a disparity confidence map based on forward-backward consistency checks. The disparity confidence map also controls the use of an adversarial loss for occlusion handling. The approach outperforms reference methods when trained and tested on light field data. Besides, we also designed the method so that it can efficiently generate a full light field from one single image, even when trained only on stereo data. This allows us to generalize our approach for view synthesis to more diverse data and semantics.

Paper Nr: 167
Title:

Domain Adaptation for Traffic Density Estimation

Authors:

Luca Ciampi, Carlos Santiago, Joao P. Costeira, Claudio Gennaro and Giuseppe Amato

Abstract: Convolutional Neural Networks have produced state-of-the-art results for a multitude of computer vision tasks under supervised learning. However, the crux of these methods is the need for a massive amount of labeled data to guarantee that they generalize well to diverse testing scenarios. In many real-world applications, there is indeed a large domain shift between the distributions of the train (source) and test (target) domains, leading to a significant drop in performance at inference time. Unsupervised Domain Adaptation (UDA) is a class of techniques that aims to mitigate this drawback without the need for labeled data in the target domain. This makes it particularly useful for the tasks in which acquiring new labeled data is very expensive, such as for semantic and instance segmentation. In this work, we propose an end-to-end CNN-based UDA algorithm for traffic density estimation and counting, based on adversarial learning in the output space. The density estimation is one of those tasks requiring per-pixel annotated labels and, therefore, needs a lot of human effort. We conduct experiments considering different types of domain shifts, and we make publicly available two new datasets for the vehicle counting task that were also used for our tests. One of them, the Grand Traffic Auto dataset, is a synthetic collection of images, obtained using the graphical engine of the Grand Theft Auto video game, automatically annotated with precise per-pixel labels. Experiments show a significant improvement using our UDA algorithm compared to the model’s performance without domain adaptation. The code, the models and the datasets are freely available at https://ciampluca.github.io/unsupervised counting.

Paper Nr: 178
Title:

Contextualise, Attend, Modulate and Tell: Visual Storytelling

Authors:

Zainy M. Malakan, Nayyer Aafaq, Ghulam M. Hassan and Ajmal Mian

Abstract: Automatic natural language description of visual content is an emerging and fast-growing topic that has attracted extensive research attention recently. However, different from typical ‘image captioning’ or ‘video captioning’, coherent story generation from a sequence of images is a relatively less studied problem. Story generation poses the challenges of diverse language style, context modeling, coherence and latent concepts that are not even visible in the visual content. Contemporary methods fall short of modeling the context and visual variance, and generate stories devoid of language coherence among multiple sentences. To this end, we propose a novel framework Contextualize, Attend, Modulate and Tell (CAMT) that models the temporal relationship among the image sequence in forward as well as backward direction. The contextual information and the regional image features are then projected into a joint space and then subjected to an attention mechanism that captures the spatio-temporal relationships among the images. Before feeding the attentive representations of the input images into a language model, gated modulation between the attentive representation and the input word embeddings is performed to capture the interaction between the inputs and their context. To the best of our knowledge, this is the first method that exploits such a modulation technique for story generation. We evaluate our model on the Visual Storytelling Dataset (VIST) employing both automatic and human evaluation measures and demonstrate that our CAMT model achieves better performance than existing baselines.

Paper Nr: 192
Title:

Fairer Evaluation of Zero Shot Action Recognition in Videos

Authors:

Kaiqiang Huang, Sarah J. Delany and Susan Mckeever

Abstract: Zero-shot learning (ZSL) for human action recognition (HAR) aims to recognise video action classes that have never been seen during model training. This is achieved by building mappings between visual and semantic embeddings. These visual embeddings are typically provided via a pre-trained deep neural network (DNN). The premise of ZSL is that the training and testing classes should be disjoint. In the parallel domain of ZSL for image input, the widespread poor evaluation protocol of pre-training on ZSL test classes has been highlighted. This is akin to providing a sneak preview of the evaluation classes. In this work, we investigate the extent to which this evaluation protocol has been used in ZSL for human action recognition research work. We show that in the field of ZSL for HAR, accuracies for overlapping classes are being boosted by between 5.75% to 51.94% depending on the use of visual and semantic features as a result of this flawed evaluation protocol. To assist other researchers in avoiding this problem in the future, we provide annotated versions of the relevant benchmark ZSL test datasets in the HAR field: UCF101 and HMDB51 datasets - highlighting overlaps to pre-training datasets in the field.

Paper Nr: 211
Title:

Global Point Cloud Descriptor for Place Recognition in Indoor Environments

Authors:

Jacek Komorowski, Grzegorz Kurzejamski, Monika Wysoczańska and Tomasz Trzcinski

Abstract: This paper presents an approach for learning-based discriminative 3D point cloud descriptor from RGB-D images for place recognition purposes in indoor environments. Existing methods, such as such as PointNetVLAD, PCAN or LPD-Net, are aimed at outdoor environments and operate on 3D point clouds from LiDAR. They are based on PointNet architecture and designed to process only the scene geometry and do not consider appearance (RGB component). In this paper we present a place recognition method based on sparse volumetric representation and processing scene appearance in addition to the geometry. We also investigate if using two modalities, appearance (RGB data) and geometry (3D structure), improves discriminativity of a resultant global descriptor.

Paper Nr: 212
Title:

Towards Combined Open Set Recognition and Out-of-Distribution Detection for Fine-grained Classification

Authors:

Alexander Gillert and Uwe F. von Lukas

Abstract: We analyze the two very similar problems of Out-of-Distribution (OOD) Detection and Open Set Recognition (OSR) in the context of fine-grained classification. Both problems are about detecting object classes that a classifier was not trained on, but while the former aims to reject invalid inputs, the latter aims to detect valid but unknown classes. Previous works on OOD detection and OSR methods are evaluated mostly on very simple datasets or datasets with large inter-class variance and perform poorly in the fine-grained setting. In our experiments, we show that object detection works well to recognize invalid inputs and techniques from the field of fine-grained classification, like individual part detection or zooming into discriminative local regions, are helpful for fine-grained OSR.

Paper Nr: 214
Title:

Analysis of Recent Re-Identification Architectures for Tracking-by-Detection Paradigm in Multi-Object Tracking

Authors:

Haruya Ishikawa, Masaki Hayashi, Trong H. Phan, Kazuma Yamamoto, Makoto Masuda and Yoshimitsu Aoki

Abstract: Person re-identification is a vital module of the tracking-by-detection framework for online multi-object tracking. Despite recent advances in multi-object tracking and person re-identification, inadequate attention was given to integrating these technologies to provide a robust multi-object tracker. In this work, we combine modern state-of-the-art re-identification models and modeling techniques on the basic tracking-by-detection framework and benchmark them on heavily occluded scenes to understand their effect. We hypothesize that temporal modeling for re-identification is crucial for training robust re-identification models for they are conditioned on sequences containing occlusions. Along with traditional image-based re-identification methods, we analyze temporal modeling methods used in video-based re-identification tasks. We also train re-identification models with different embedding methods, including triplet loss, and analyze their effect. We benchmark the re-identification models on the challenging MOT20 dataset containing crowded scenes with various occlusions. We provide a thorough assessment and investigation of the usage of modern re-identification modeling methods and prove that these methods are, in fact, effective for multi-object tracking. Compared to baseline methods, results show that these models can provide robust re-identification proved by improvements in the number of identity switching, MOTA, IDF1, and other metrics.

Short Papers
Paper Nr: 2
Title:

Facial Exposure Quality Estimation for Aesthetic Evaluation

Authors:

Mathias Gudiksen, Sebastian Falk, Lasse N. Hansen, Frederik B. Jensen and Andreas Møgelmose

Abstract: In recent years, computer vision systems have excelled in detection and classification problems. Many vision tasks, however, are not easily reduced to such a problem. Often, more subjective measures must be taken into account. Such problems have seen significantly less research. In this paper, we tackle the problem of aesthetic evaluation of photographs, particularly with respect to exposure. We propose and compare three methods for estimating the exposure value of a photograph using regression: SVM on handcrafted features, NN using image histograms, and the VGG19 CNN. A dataset containing 844 images with different exposure values was created. The methods were tested on both the full photographs and a cropped version of the dataset. Our methods estimate the exposure value of our test set with an MAE of 0.496 using SVM, an MAE of 0.498 using NN, and an MAE of 0.566 using VGG19, on the cropped dataset. Without a face detector we achieve an MAE of 0.702 for SVM, 0.766 using NN, and 1.560 for VGG19. The models based on handcrafted features or histograms both outperform the CNN in the case of simpler scenes, with the histogram outperforming the handcrafted features slightly. However, on more complicated scenes, the CNN shows promise. In most cases, handcrafted features seem to be the better option, despite this, the use of CNNs cannot be ruled out entirely.

Paper Nr: 11
Title:

Learning Unsupervised Cross-domain Image-to-Image Translation using a Shared Discriminator

Authors:

Rajiv Kumar, Rishabh Dabral and G. Sivakumar

Abstract: Unsupervised image-to-image translation is used to transform images from a source domain to generate images in a target domain without using source-target image pairs. Promising results have been obtained for this problem in an adversarial setting using two independent GANs and attention mechanisms. We propose a new method that uses a single shared discriminator between the two GANs, which improves the overall efficacy. We assess the qualitative and quantitative results on image transfiguration, a cross-domain translation task, in a setting where the target domain shares similar semantics to the source domain. Our results indicate that even without adding attention mechanisms, our method performs at par with attention-based methods and generates images of comparable quality.

Paper Nr: 25
Title:

Crowd Behavior Analysis based on Convolutional Neural Network: Social Distancing Control COVID-19

Authors:

Fatma Bouhlel, Hazar Mliki and Mohamed Hammami

Abstract: The outbreak of the COVID-19 and the lack of pharmaceutical intervention increase the spread of COVID-19. Since no vaccine or treatment are yet available, social distancing represents a good strategy to control the propagation of this pandemic and learn to live with it. In this context, we introduce a new approach for crowd behavior analysis from UAV-captured video sequences in order to monitor social distancing. The proposed approach involves two methods: a macroscopic method and a microscopic method. The macroscopic method aims to estimate the crowd density by classifying the aerial frame patches into four categories: Dense, Sparse, Medium and None. However, the microscopic method allows to detect and track humans and then compute the distance between them. The quantitative and qualitative results validate the performance of our methods compared to the state-of-the-art references.

Paper Nr: 28
Title:

Three-step Alignment Approach for Fitting a Normalized Mask of a Person Rotating in A-Pose or T-Pose Essential for 3D Reconstruction based on 2D Images and CGI Derived Reference Target Pose

Authors:

Gerald A. Zwettler, Christoph Praschl, David Baumgartner, Tobias Zucali, Dora Turk, Martin Hanreich and Andreas Schuler

Abstract: The 3D silhouette reconstruction of a human body rotating in front of a monocular camera system is a very challenging task due to elastic deformation and positional mismatch from body motion. Nevertheless, knowledge of the 3D body shape is a key information for precise determination of one’s clothing sizes, e.g. for precise shopping to reduce the number of return shipments in online retail. In this paper a novel three step alignment process is presented, utilizing As-Rigid-As-Possible (ARAP) transformations to normalize the body joint skeleton derived from OpenPose with a CGI rendered reference model in A- or T-pose. With further distance-map accelerated registration steps, positional mismatches and inaccuracies from the OpenPose joint estimation are compensated thus allowing for 3D silhouette reconstruction of a moving and elastic object without the need for sophisticated statistical shape models. Tests on both, artificial and real-world data, generally proof the practicability of this approach with all three alignment/registration steps essential and adequate for 3D silhouette reconstruction data normalization.

Paper Nr: 31
Title:

Weakly-supervised Human-object Interaction Detection

Authors:

Masaki Sugimoto, Ryosuke Furuta and Yukinobu Taniguchi

Abstract: Human-Object Interaction detection is the image recognition task of detecting pairs (a person and an object) in an image and estimating the relationships between them, such as “holding” or “riding”. Existing methods based on supervised learning require a lot of effort to create training data because they need the supervision provided as Bounding Boxes (BBs) of people and objects and verb labels that represent the relationships. In this paper, we extend Proposal Cluster Learning (PCL), a weakly-supervised object detection method, for a new task called weakly-supervised human-object interaction detection, where only the verb labels are assigned to the entire images (i.e., no BBs are given) during the training. Experiments show that the proposed method can successfully learn to detect the BBs of people and objects and the verb labels between them without instance-level supervision.

Paper Nr: 41
Title:

ChartSight: An Automated Scheme for Assisting Visually Impaired in Understanding Scientific Charts

Authors:

Mandhatya Singh and Puneet Goyal

Abstract: The visual or Non-Textual components like charts, graphs, and plots are frequently used to represent the latent information in digital documents. These components bolster in better comprehension of the underlying complex information. However, these data visualization techniques are of not much use to visually impaired. Visually impaired people, especially in developing countries, rely on braille, tactile, or other conventional tools for reading purposes. Through these approaches, the understanding of Non-Textual components is a burdensome process with serious limitations. In this paper, we present ChartSight, an automated and interactive chart understanding system. ChartSight extracts and classifies the document images into different chart categories, and then uses heuristics-based content extraction methods optimized for line and bar charts. It finally represents the summarized content in audio format to the visually impaired users. We have presented a densely connected convolution network-based data-driven scheme for the chart classification problem, which shows comparatively better performance with the baseline models. Multiple datasets of chart images are used for the performance analysis. A comparative analysis of supporting features has also been performed with the other existing approaches.

Paper Nr: 44
Title:

No Need for a Lab: Towards Multi-sensory Fusion for Ambient Assisted Living in Real-world Living Homes

Authors:

Alessandro Masullo, Toby Perrett, Dima Damen, Tilo Burghardt and Majid Mirmehdi

Abstract: The majority of the Ambient Assisted Living (AAL) systems, designed for home or lab settings, monitor one participant at a time – this is to avoid the complexities of pre-fusion correspondence of different sensors since carers, guests, and visitors may be involved in real world scenarios. Previous work from (Masullo et al., 2020) presented a solution to this problem that involves matching video sequences of silhouettes to accelerations from wearable sensors to identify members of a household while respecting their privacy. In this work, we elevate this approach to the next stage by improving its architecture and combining it with a tracking functionality that makes it possible to be deployed in real-world homes. We present experiments on a new dataset recorded in participants’ own houses, which includes multiple participants visited by guests, and show an auROC score of 90.2%. We also show a novel first example of subject-tailored health monitoring measurement by applying our methodology to a sit-to-stand detector to generate clinically relevant rehabilitation trends.

Paper Nr: 49
Title:

Improving Car Model Classification through Vehicle Keypoint Localization

Authors:

Alessandro Simoni, Andrea D’Eusanio, Stefano Pini, Guido Borghi and Roberto Vezzani

Abstract: In this paper, we present a novel multi-task framework which aims to improve the performance of car model classification leveraging visual features and pose information extracted from single RGB images. In particular, we merge the visual features obtained through an image classification network and the features computed by a model able to predict the pose in terms of 2D car keypoints. We show how this approach considerably improves the performance on the model classification task testing our framework on a subset of the Pascal3D+ dataset containing the car classes. Finally, we conduct an ablation study to demonstrate the performance improvement obtained with respect to a single visual classifier network.

Paper Nr: 58
Title:

Long-term Behaviour Recognition in Videos with Actor-focused Region Attention

Authors:

Luca Ballan, Ombretta Strafforello and Klamer Schutte

Abstract: Long-Term activities involve humans performing complex, minutes-long actions. Differently than in traditional action recognition, complex activities are normally composed of a set of sub-actions, that can appear in different order, duration, and quantity. These aspects introduce a large intra-class variability, that can be hard to model. Our approach aims to adaptively capture and learn the importance of spatial and temporal video regions for minutes-long activity classification. Inspired by previous work on Region Attention, our architecture embeds the spatio-temporal features from multiple video regions into a compact fixed-length representation. These features are extracted with a 3D convolutional backbone specially fine-tuned. Additionally, driven by the prior assumption that the most discriminative locations in the videos are centered around the human that is carrying out the activity, we introduce an Actor Focus mechanism to enhance the feature extraction both in training and inference phase. Our experiments show that the Multi-Regional fine-tuned 3D-CNN, topped with Actor Focus and Region Attention, largely improves the performance of baseline 3D architectures, achieving state-of-the-art results on Breakfast, a well known long-term activity recognition benchmark.

Paper Nr: 60
Title:

A Multi-level Rank Correlation Measure for Image Retrieval

Authors:

Nikolas Gomes de Sá, Lucas P. Valem and Daniel G. Pedronette

Abstract: Accurately ranking the most relevant elements in a given scenario often represents a central challenge in many applications, composing the core of retrieval systems. Once ranking structures encode relevant similarity information, measuring how correlated are two rank results represents a fundamental task, with diversified applications. In this work, we propose a new rank correlation measure called Multi-Level Rank Correlation Measure (MLCM), which employs a novel approach based on a multi-level analysis for estimating the correlation between ranked lists. While traditional weighted measures assign more relevance to top positions, our proposed approach goes beyond by considering the position at different levels in the ranked lists. The effectiveness of the proposed measure was assessed in unsupervised and weakly supervised learning tasks for image retrieval. The experimental evaluation considered 6 correlation measures as baselines, 3 different image datasets, and multiple features. The results are competitive or, in most of the cases, superior to the baselines, achieving significant effectiveness gains.

Paper Nr: 63
Title:

Multimodal Neural Network for Sentiment Analysis in Embedded Systems

Authors:

Quentin Portes, José M. Carvalho, Julien Pinquier and Frédéric Lerasle

Abstract: Multimodal neural network in sentiment analysis uses video, text and audio. Processing these three modalities tends to create computationally high models. In the embedded context, all resources and specifically computational resources are restricted. In this paper, we design models dealing with these two antagonist issues. We focused our work on reducing the numbers of model input features and the size of the different neural network architectures. The major contribution in this paper is the design of a specific 3D Residual Network instead of using a basic 3D convolution. Our experiments are focused on the well-known dataset MOSI (Multimodal Corpus of Sentiment Intensity). The objective is to perform similar results as the state of the art. Our best multimodal approach achieves a F1 score of 80% with a number of parameters reduced by 2.2 and the memory load reduced by a factor 13.8, compared to the state of the art. We designed five models, one for each modality (i.e video, audio and text) and one for each fusion technique. The two high-level multimodal fusions presented in this paper are based on the evidence theory and on a neural network approach.

Paper Nr: 69
Title:

A Snapshot-based Approach for Self-supervised Feature Learning and Weakly-supervised Classification on Point Cloud Data

Authors:

Xingye Li and Zhigang Zhu

Abstract: Manually annotating complex scene point cloud datasets is both costly and error-prone. To reduce the reliance on labeled data, we propose a snapshot-based self-supervised method to enable direct feature learning on the unlabeled point cloud of a complex 3D scene. A snapshot is defined as a collection of points sampled from the point cloud scene. It could be a real view of a local 3D scan directly captured from the real scene, or a virtual view of such from a large 3D point cloud dataset. First the snapshots go through a self-supervised pipeline including both part contrasting and snapshot clustering for feature learning. Then a weakly-supervised approach is implemented by training a standard SVM classifier on the learned features with a small fraction of labeled data. We evaluate the weakly-supervised approach for point cloud classification by using varying numbers of labeled data and study the minimal numbers of labeled data for a successful classification. Experiments are conducted on three public point cloud datasets, and the results have shown that our method is capable of learning effective features from the complex scene data without any labels.

Paper Nr: 70
Title:

Early Bird: Loop Closures from Opposing Viewpoints for Perceptually-aliased Indoor Environments

Authors:

Satyajit Tourani, Dhagash Desai, Udit S. Parihar, Sourav Garg, Ravi K. Sarvadevabhatla, Michael Milford and K. M. Krishna

Abstract: Significant recent advances have been made in Visual Place Recognition (VPR), feature correspondence and localization due to deep-learning-based methods. However, existing approaches tend to address, partially or fully, only one of two key challenges: viewpoint change and perceptual aliasing. In this paper, we present novel research that simultaneously addresses both challenges by combining deep-learnt features with geometric transformations based on domain knowledge about navigation on a ground-plane, without specialized hardware (e.g. downwards facing cameras, etc.). In particular, our integration of VPR with SLAM by leveraging the robustness of deep-learnt features and our homography-based extreme viewpoint invariance significantly boosts the performance of VPR, feature correspondence and pose graph sub-modules of the SLAM pipeline. We demonstrate a localization system capable of state-of-the-art performance despite perceptual aliasing and extreme 180-degree-rotated viewpoint change in a range of real-world and simulated experiments. Our system is able to achieve early loop closures that prevent significant drifts in SLAM trajectories.

Paper Nr: 73
Title:

3D Object Classification via Part Graphs

Authors:

Florian Teich, Timo Lüddecke and Florentin Wörgötter

Abstract: 3D object classification often requires extraction of a global shape descriptor in order to predict the object class. In this work, we propose an alternative part-based approach. This involves automatically decomposing objects into semantic parts, creating part graphs and employing graph kernels on these graphs to classify objects based on the similarity of the part graphs. By employing this bottom-up approach, common substructures across objects from training and testing sets should be easily identifiable and may be used to compute similarities between objects. We compare our approach to state-of-the art methods relying on global shape description and obtain superior performance through the use of part graphs.

Paper Nr: 76
Title:

Graph Convolutional Networks Skeleton-based Action Recognition for Continuous Data Stream: A Sliding Window Approach

Authors:

Mickael Delamare, Cyril Laville, Adnane Cabani and Houcine Chafouk

Abstract: This paper introduces a novel deep learning-based approach to human action recognition. The method consists of a Spatio-Temporal Graph Convolutional Network operating in real-time thanks to a sliding window approach. The proposed architecture consists of a fixed window for training, validation, and test process with a Spatio-Temporal-Graph Convolutional Network for skeleton-based action recognition. We evaluate our architecture on two available datasets of common continuous stream action recognition, the Online Action Detection dataset, and UOW Online Action 3D datasets. This method is utilized for temporal detection and classification of the performed action recognition in real-time.

Paper Nr: 85
Title:

Non-Maximum Suppression for Unknown Class Objects using Image Similarity

Authors:

Yoshiaki Homma, Toshiki Kikuchi and Yuko Ozasa

Abstract: As a post-processing step for object detection, non-maximum suppression (NMS) has been widely used for many years. Greedy-NMS, which is one of the most widely used NMS methods, is effective if the class of objects is known but not if the class of objects is unknown. To overcome this drawback, we propose an NMS method using an image similarity index that is independent of learning. Even if the overlap of bounding boxes that locate different objects is large, they are considered to have located different objects if the similarity of the images in the bounding boxes is low. In order to evaluate the proposed method, we built a new dataset containing unknown class objects. Our experimental results show that the proposed method can reduce the rate of undetected unknown class objects when using greedy-NMS.

Paper Nr: 92
Title:

Detection of Distraction-related Actions on DMD: An Image and a Video-based Approach Comparison

Authors:

Paola N. Cañas, Juan D. Ortega, Marcos Nieto and Oihana Otaegui

Abstract: The recently presented Driver Monitoring Dataset (DMD) extends research lines for Driver Monitoring Systems. We intend to explore this dataset and apply commonly used methods for action recognition to this specific context, from image-based to video-based analysis. Specially, we aim to detect driver distraction by applying action recognition techniques to classify a list of distraction-related activities. This is now possible thanks to the DMD, that offers recordings of distracted drivers in video format. A comparison between different state-of-the-art models for image and video classification is reviewed. Also, we discuss the feasibility of implementing image-based or video-based models in a real-context driver monitoring system. Preliminary results are presented in this article as a point of reference to future work on the DMD.

Paper Nr: 93
Title:

Lightweight Filtering of Noisy Web Data: Augmenting Fine-grained Datasets with Selected Internet Images

Authors:

Julia Böhlke, Dimitri Korsch, Paul Bodesheim and Joachim Denzler

Abstract: Despite the availability of huge annotated benchmark datasets and the potential of transfer learning, i.e., fine-tuning a pre-trained neural network to a specific task, deep learning struggles in applications where no labeled datasets of sufficient size exist. This issue affects fine-grained recognition tasks the most since correct image data annotations are expensive and require expert knowledge. Nevertheless, the Internet offers a lot of weakly annotated images. In contrast to existing work, we suggest a new lightweight filtering strategy to exploit this source of information without supervision and minimal additional costs. Our main contributions are specific filter operations that allow the selection of downloaded images to augment a training set. We filter test duplicates to avoid a biased evaluation of the methods, and two types of label noise: cross-domain noise, i.e., images outside any class in the dataset, and cross-class noise, a form of label-swapping noise. We evaluate our suggested filter operations in a controlled environment and demonstrate our methods’ effectiveness with two small annotated seed datasets for moth species recognition. While noisy web images consistently improve classification accuracies, our filtering methods retain a fraction of the data such that high accuracies are achieved with a significantly smaller training dataset.

Paper Nr: 98
Title:

Automatic Annotation and Segmentation of Sign Language Videos: Base-level Features and Lexical Signs Classification

Authors:

Hussein Chaaban, Michèle Gouiffès and Annelies Braffort

Abstract: The automatic recognition of Sign Languages is the main focus of most of the works in the field, which explains the progressing demand on the annotated data to train the dedicated models. In this paper, we present a semi automatic annotation system for Sign Languages. Such automation will not only help to create training data but it will reduce as well the processing time and the subjectivity of manual annotations done by linguists in order to study the sign language. The system analyses hand shapes, hands speed variations, and face landmarks to annotate base level features and to separate the different signs. In a second stage, signs are classified into two types, whether they are lexical (i.e. present in a dictionary) or iconic (illustrative), using a probabilistic model. The results show that our system is partially capable of annotating automatically the video sequence with a F1 score = 0.68 for lexical sign annotation and an error of 3.8 frames for sign segmentation. An expert validation of the annotations is still needed.

Paper Nr: 110
Title:

Collaborative Learning of Generative Adversarial Networks

Authors:

Takuya Tsukahara, Tsubasa Hirakawa, Takayoshi Yamashita and Hironobu Fujiyoshi

Abstract: Generative adversarial networks (GANs) adversarially train generative and discriminative and generate a nonexistent images. Common GANs use only a single generative model and discriminant model and are considered to maximize their performance. On the other hand, in the image-classification task, recognition accuracy improves by collaborative learning in which knowledge transfer is conducted among several neural networks. Therefore, we propose a method that involves using GANs with multiple generative models and one discriminant model to conduct collaborative learning while transferring information among the generative models. We conducted experiments to evaluate the proposed method, and the results indicate that the quality of the images produced by the proposed method is improved and increased in diversity.

Paper Nr: 117
Title:

Occluded Iris Recognition using SURF Features

Authors:

Anca Ignat and Ioan Păvăloi

Abstract: In this paper we study the problem of the recognition process for iris images with missing information. Our approach uses keypoints related features for solving this problem. We present our recognition results obtained using SURF (Speeded-Up Robust Features) features extracted from occluded iris images. We tested the influence on the recognition rate of two threshold parameters, one linked with the SURF extraction process and the other with the keypoint matching scheme. The proposed method was tested on UPOL iris database using eleven levels of occlusion. The experiments show that the method we describe in this paper produces better results than Daugman procedure on all considered datasets and the results we previously obtained using SIFT features. Comparisons were also performed with iris recognition results that use colour for iris characterization, computed on the same databases of irises with different levels of missing information.

Paper Nr: 158
Title:

Multi-Task Architecture with Attention for Imaging Atmospheric Cherenkov Telescope Data Analysis

Authors:

Mikaël Jacquemont, Thomas Vuillaume, Alexandre Benoit, Gilles Maurin and Patrick Lambert

Abstract: Gamma-ray reconstruction from Cherenkov telescope data is multi-task by nature in astrophysics. The image recorded in the Cherenkov camera pixels relates to the type, energy, incoming direction and distance of a particle from a telescope observation. We propose γ-PhysNet, a physically inspired multi-task deep neural network for gamma/proton particle classification, and gamma energy and direction reconstruction. We compare its performance with single task networks on Monte Carlo simulated data and demonstrate the interest of reconstructing the impact point as an auxiliary task. We also show that γ-PhysNet outperforms a widespread analysis method for gamma-ray reconstruction. Finally, we study attention methods to solve relevant use cases. All the experiments are conducted in the context of single telescope analysis for the Cherenkov Telescope Array data analysis.

Paper Nr: 159
Title:

GAPF: Curve Text Detection based on Generative Adversarial Networks and Pixel Fluctuations

Authors:

Jun Yang, Zhaogong Zhang and Xuexia Wang

Abstract: Scene text detection has witnessed rapid progress especially with the recent development of convolutional neural networks. However, curved text detection is still a difficult problem that has not been addressed sufficiently. Presently, the most advanced method is based on segmentation to detect curved text. However, most segmentation algorithms based on convolutional neural networks have the problem of inaccurate segmentation results. In order to improve the effect of image segmentation, we propose a semantic segmentation network model based on generative adversarial networks and pixel fluctuations, denoted as GAPF; which is able to effectively improve the accuracy of text segmentation. The model consists of two parts: the generative model and the discriminative model. The main function of the generative model is to generate semantic segmentation graph, and then the discriminative model and generative model perform adversarial learning, which optimize the generative model to make the generated image closer to the ground truth. In this paper, the information about pixel fluctuations numbers is input into the generative network as the segmentation condition to enhance the invariance of translation and rotation. Finally, a text boundary generation algorithm for text is designed, and the final detection result is obtained from the segmentation result. Experimental results on CTW1500, Total-Text, ICDAR 2015 and MSRA-TD500 demonstrate the effectiveness of our work.

Paper Nr: 161
Title:

Automated Infant Monitoring based on R-CNN and HMM

Authors:

Cheng Li, A. Pourtaherian, L. van Onzenoort and P. H. N. de With

Abstract: Manual monitoring of young infants suffering from reflux is a significant effort, since infants can hardly articulate their feelings. This work proposes a near real-time video-based infant monitoring system for the analysis of infant expressions. The discomfort moments can be correlated with a reflux measurement for gastroesophageal reflux disease diagnose. The system consists of two components: expression classification and expression state stabilization. The expression classification is realized by Faster R-CNN and the state stabilization is implemented with a Hidden Markov Model. The experimental results show a mean average precision of 82.3% and 83.4% for 7 different expression classifications, and up to 90% for discomfort detection, evaluated with both clinical and daily datasets. Moreover, when adopting temporal analysis, the false expression changes between frames can be reduced up to 65%, which significantly enhances the consistency of the system output.

Paper Nr: 168
Title:

On Power Jaccard Losses for Semantic Segmentation

Authors:

David Duque-Arias, Santiago Velasco-Forero, Jean-Emmanuel Deschaud, François Goulette, Andres Serna, Etienne Decencière and Beatriz Marcotegui

Abstract: In this work, a new generalized loss function is proposed called power Jaccard to perform semantic segmentation tasks. It is compared with classical loss functions in different scenarios, including gray level and color image segmentation, as well as 3D point cloud segmentation. The results show improved performance, stability and convergence. We made available the code with our proposal with a demonstrative example.

Paper Nr: 169
Title:

3D Object Detection with Normal-map on Point Clouds

Authors:

Jishu Miao, Tsubasa Hirakawa, Takayoshi Yamashita and Hironobu Fujiyoshi

Abstract: In this paper, we propose a novel point clouds based 3D object detection method for achieving higher-accuracy of autonomous driving. Different types of objects on the road has a different shape. A LiDAR sensor can provide a point cloud including more than ten thousand points reflected from object surfaces in one frame. Recent studies show that hand-crafted features directly extracted from point clouds can achieve nice detection accuracy. The proposed method employs YOLOv4 as feature extractor and gives Normal-map as additional input. Our Normal-map is a three channels bird’s eye view image, retaining detailed object surface normals. It makes the input information have more enhanced spatial shape information and can be associated with other hand-crafted features easily. In an experiment on the KITTI 3D object detection dataset, it performs better than conventional methods. Our method can achieve higher-precision 3D object detection and is less affected by distance. It has excellent yaw angle predictability for the object, especially for cylindrical objects like pedestrians, even if it omits the intensity information.

Paper Nr: 175
Title:

Feature Sharing Cooperative Network for Semantic Segmentation

Authors:

Ryota Ikedo and Kazuhiro Hotta

Abstract: In recent years, deep neural networks have achieved high accuracy in the field of image recognition. By inspired from human learning method, we propose a semantic segmentation method using cooperative learning which shares the information resembling a group learning. We use two same networks and paths for sending feature maps between two networks. Two networks are trained simultaneously. By sharing feature maps, one of two networks can obtain the information that cannot be obtained by a single network. In addition, in order to enhance the degree of cooperation, we propose two kinds of methods that connect only the same layer and multiple layers. We evaluated our proposed idea on two kinds of networks. One is Dual Attention Network (DANet) and the other one is DeepLabv3+. The proposed method achieved better segmentation accuracy than the conventional single network and ensemble of networks.

Paper Nr: 177
Title:

Analysing Adversarial Examples for Deep Learning

Authors:

Jason Jung, Naveed Akhtar and Ghulam M. Hassan

Abstract: The aim of this work is to investigate adversarial examples and look for commonalities and disparities between different adversarial attacks and attacked classifier model behaviours. The research focuses on untargeted, gradient-based attacks. The experiment uses 16 attacks on 4 models and 1000 images. This resulted in 64,000 adversarial examples. The resulting classification predictions of the adversarial examples (adversarial labels) are analysed. It is found that light-weight neural network classifiers are more suspectable to attacks compared to the models with a larger or more complex architecture. It is also observed that similar adversarial attacks against a light-weight model often result in the same adversarial label. Moreover, the attacked models have more influence over the resulting adversarial label as compared to the adversarial attack algorithm itself. These finding are helpful in understanding the intriguing vulnerability of deep learning to adversarial examples.

Paper Nr: 185
Title:

SCAN: Sequence-character Aware Network for Text Recognition

Authors:

Heba Hassan, Marwan Torki and Mohamed E. Hussein

Abstract: Text recognition continues to be a challenging problem in the context of text reading in natural scenes. Bearing in mind the sequential nature of text, the problem is usually posed as a sequence prediction problem from a whole-word image. Alternatively, it can also be posed as a character prediction problem. The latter approach is typically more robust to challenging word shapes. Attempting to find the sweet spot that attains the best of the two approaches, we propose Sequence-Character Aware Network (SCAN). SCAN starts by locating and recognizing the characters, and then generates the word using a sequence-based approach. It comprises two modules: a semantic-segmentation-based character prediction, and an encoder-decoder network for word generation. The training is done over two stages. In the first stage, we adopt a multi-task training technique with both character-level and word-level losses and trainable loss weighting. In the second stage, the character-level loss is removed, enabling the use of data with only word-level annotations. Experiments are conducted on several datasets for both regular and irregular text, showing state of the art performance of the proposed approach. It also shows that the proposed approach is robust against noisy word detection.

Paper Nr: 190
Title:

Supervised versus Self-supervised Assistant for Surveillance of Harbor Fronts

Authors:

Jinsong Liu, Mark P. Philipsen and Thomas B. Moeslund

Abstract: Drowning in harbors and along waterfronts is a serious problem, worsened by the challenge of achieving timely rescue efforts. To address this problem, we propose a privacy-friendly assistant surveillance system for identifying potentially hazardous situations (human activities near the water’s edge) in order to give early warning. This will allow lifeguards and first responders to react proactively with a basis in accurate information. In order to achieve this, we develop and compare two vision-based solutions. One is a supervised approach based on the popular object detection framework, which allows us to detect humans in a defined area near the water’s edge. The other is a self-supervised approach where anomalies are detected based on the reconstruction error from an autoencoder. To best comply with privacy requirements both solutions rely on thermal imaging captured in an active harbor environment. With a dataset having both safe and risky scenes, the two solutions are evaluated and compared, showing that the detector-based method wins in terms of performances, while the autoencoder-based method has the benefit of not requiring expensive annotations.

Paper Nr: 191
Title:

Segment My Object: A Pipeline to Extract Segmented Objects in Images based on Labels or Bounding Boxes

Authors:

Robin Deléarde, Camille Kurtz, Philippe Dejean and Laurent Wendling

Abstract: We propose a pipeline (SegMyO – Segment my object) to automatically extract segmented objects in images based on given labels and / or bounding boxes. When providing the expected label, our system looks for the closest label in the list of outputs, using a measure of semantic similarity. And when providing the bounding box, it looks for the output object with the best coverage, based on several geometric criteria. Associated with a semantic segmentation model trained on a similar dataset, or a good region proposal algorithm, this pipeline provides a simple solution to segment efficiently a dataset without requiring specific training, but also to the problem of weakly-supervised segmentation. This is particularly useful to segment public datasets available with weak object annotations (e.g., bounding boxes and labels from a detection, labels from a caption) coming from an algorithm or from manual annotation. An experimental study conducted on the PASCAL VOC 2012 dataset shows that these simple criteria embedded in SegMyO allow to select the proposal with the best IoU score in most cases, and so to get the best of the pre-segmentation.

Paper Nr: 202
Title:

Embedding Human Knowledge into Deep Neural Network via Attention Map

Authors:

Masahiro Mitsuhara, Hiroshi Fukui, Yusuke Sakashita, Takanori Ogata, Tsubasa Hirakawa, Takayoshi Yamashita and Hironobu Fujiyoshi

Abstract: The conventional method to embed human knowledge has been applied for non-deep machine learning. Meanwhile, it is challenging to apply it for deep learning models due to the enormous number of model parameters. In this paper, we propose a novel framework for optimizing networks while embedding human knowledge. The crucial factors are an attention map for visual explanation and an attention mechanism. A manually edited attention map, in which human knowledge is embedded, has the potential to adjust recognition results. The proposed method updates network parameters so that the output attention map corresponds to the edited ones. As a result, the trained network can output an attention map that takes into account human knowledge. Experimental results with ImageNet, CUB-200-2010, and IDRiD demonstrate that it is possible to obtain a clear attention map for a visual explanation and improve the classification performance.

Paper Nr: 215
Title:

Player Identification in Different Sports

Authors:

Ahmed Nady and Elsayed E. Hemayed

Abstract: Identifying players through jersey numbers in sports videos is a challenging task. Jersey number can be distorted and deformed due to variation of the player’s posture and the camera’s view. Moreover, it varies in font and size due to the different sports fields. In this paper, we present a deep learning-based framework to address these challenges of jersey number recognition. Our framework has three main parts. Firstly, it detects players on the court using state of the art object detector YOLO V4. Secondly, each jersey number per detected player bounding boxes is localized. Then a four-stage scene text recognition is employed for recognizing detected number regions. A benchmark dataset consists of three subsets is collected. Two subsets include player images from different fields in basketball sport and the third includes player images from ice hockey sport. Experiments show that the proposed approach is effective compared to state-of-the-art jersey number recognition methods. This research makes the automation of player identification applicable across several sports.

Paper Nr: 226
Title:

Performance Benchmarking of YOLO Architectures for Vehicle License Plate Detection from Real-time Videos Captured by a Mobile Robot

Authors:

Amir Ismail, Maroua Mehri, Anis Sahbani and Najoua E. Ben Amara

Abstract: In this paper, we address the issue of vehicle license plate (LP) detection for a mobile robotic application. Specifically, we tackle the dynamic scenario of a robot in the physical world interacting based on its cameras. The robot is dedicated essentially to patrol and secure unconstrained environments. Counter to the most recent works of LP detection which assume controlled deploying scenario, the mobile platform requires a more robust system that is suitable for various complex scenarios. To contribute to this purpose, we propose an end-to-end detection module capable of localizing LP either in images or in live-streaming videos. The proposed system is based on deep learning based detectors, particularly the most recent YOLOv4-tiny one. To evaluate the proposed system, we introduce the first-ever public Tunisian dataset, called PGTLP, for LP detection that contains 3,000 annotated images. This dataset was gathered using the security robot during its patrolling and surveillance of parking stations and high-risk areas. For the detection, a comparative study for the different YOLO variants has been carried out in order to select the best detector. Our experiments are performed on the PGTLP images and following the same experimental protocol. Among the selected models, YOLOv4-tiny reveals the best compromise between detection performance and complexity. Further experiments that have been conducted using the AOLP benchmark dataset point out that the proposed system has satisfying results.

Paper Nr: 243
Title:

Audience Shot Detection for Automatic Analysis of Soccer Sports Videos

Authors:

Kazimierz Choroś

Abstract: The automatic categorization is all the time a great challenge of content-based indexing of sports videos. The great part of different video archives, portals, Web databases contains a huge amount of sports videos data. One of the most significant processes of sports news videos analysis is automatic recognition of a sports discipline reported in a video. Different strategies are applied: pattern frame comparison, line detection in playing fields, player detection, sports equipment detection, or detection of superimposed text, and others. Usually audience shots are processed like other non-player shots, considered as not useful for video content analysis. This paper presents an approach of automatic detection of audience shots which are however useful for automatic categorization of sports videos. The audience shots in sports videos can be considered as very informative parts helping to detect and recognize not only sports disciplines, but also nationality or club membership, as well as emotions of supporters. The method is based on the integration of the analysis of segment color histograms of video frames, detection of shots, and face detection. Color histograms are applied to detect audience frames and shots. Because the dominant color as a unique criterion is not efficient in audience detection this procedure has been improved by analyzing not only single frames but sequences of frames belonging to one shot. Then a face detection method has been introduced to find the most suitable audience shots for content analysis of sports videos. The tests have been performed on soccer sports videos.

Paper Nr: 12
Title:

Approaching the Semantic Segmentation in Medical Problems: A Solution for Pneumothorax Detection

Authors:

Călin Timbus, Vlad Miclea and Camelia Lemnaru

Abstract: We present a method for detecting and delineating pneumothorax from X-Ray medical images by using a threestep processing pipeline: a deep learning classification module, responsible for detecting the possible existence of a collapsed lung within an image, followed by a segmentation model applied on the positive samples (as detected by the classification module). The last module attempts to eliminate possible artefacts based on their size. We demonstrate how the pipeline employed significantly improves the results, by increasing the mean-Dice coefficient metric by 0.13, in comparison with the performance of a single segmentation module. In addition to this, we demonstrate that using together specific state-of-the-art techniques leads to improved results, without employing techniques such as dataset enrichment from external sources, semi-supervised learning or pretraining on much larger medical datasets.

Paper Nr: 36
Title:

Online Point Cloud Object Recognition System using Local Descriptors for Real-time Applications

Authors:

Yacine Yaddaden, Sylvie Daniel and Denis Laurendeau

Abstract: In the context of vehicle localization based on point cloud data collected using LiDAR sensors, several 3D descriptors might be employed to highlight the relevant information about the vehicle’s environment. However, it is still a challenging task to assess which one is the more suitable with respect to the constraint of real-time processing. In this paper, we propose a system based on classical machine learning techniques and performing recognition from point cloud data after applying several preprocessing steps. We compare the performance of two distinct state-of-the-art local 3D descriptors namely Unique Shape Context and Signature of Histograms of Orientation when combined with online learning algorithms. The proposed system also includes two distinct modes namely normal and cluster to deal with the point cloud data size and for which performances are evaluated. In order to measure the performance of the proposed system, we used a benchmark RGB-D object dataset from which we randomly selected three stratified subsets. The obtained results are promising and suggesting further experimentation involving real data collected from LiDAR sensors on vehicles.

Paper Nr: 43
Title:

Unsupervised Domain Adaptation from Synthetic to Real Images for Anchorless Object Detection

Authors:

Tobias Scheck, Ana P. Grassi and Gangolf Hirtz

Abstract: Synthetic images are one of the most promising solutions to avoid high costs associated with generating annotated datasets to train supervised convolutional neural networks (CNN). However, to allow networks to generalize knowledge from synthetic to real images, domain adaptation methods are necessary. This paper implements unsupervised domain adaptation (UDA) methods on an anchorless object detector. Given their good performance, anchorless detectors are increasingly attracting attention in the field of object detection. While their results are comparable to the well-established anchor-based methods, anchorless detectors are considerably faster. In our work, we use CenterNet, one of the most recent anchorless architectures, for a domain adaptation problem involving synthetic images. Taking advantage of the architecture of anchorless detectors, we propose to adjust two UDA methods, viz., entropy minimization and maximum squares loss, originally developed for segmentation, to object detection. Our results show that the proposed UDA methods can increase the mAP from 61% to 69% with respect to direct transfer on the considered anchorless detector. The code is available: https://github.com/scheckmedia/centernet-uda.

Paper Nr: 45
Title:

Tropical Skin Disease Classification using Connected Attribute Filters

Authors:

Fred N. Kiwanuka, Omar E. Abuelmaatti, Anang M. Amin and Brian J. Mukwaya

Abstract: Morphological connected filters operate on an image through flat zones which comprise the largest connected components with a constant signal. These filters identify and ultimately extract the whole connected components in an image without alteration of their boundaries and thus shape preserving. This is a desirable property in many image processing and analysis applications. However, due to the variability of the number of connected components, even in the case of images of the same resolution and size, their application in classification tasks has been limited. In this study, we propose an approach that computes the shape and size features of connected components and use these features for the classification of bacterial and viral tropical skin infections. We demonstrate the performance of the approach using gradient boosting machines and compare the results to deep learning approaches. Results show that the performance of our approach is comparable to that of Convolutional Neural Networks (CNN) based approach when trained on 1460 images. Moreover, CNN was pre-trained and required augmentation to achieve that perfomance. However, our approach is at least 56% faster than CNN.

Paper Nr: 47
Title:

Are Image Patches Beneficial for Initializing Convolutional Neural Network Models?

Authors:

Daniel Lehmann and Marc Ebner

Abstract: Before a neural network can be trained the network weights have to be initialized somehow. If a model is trained from scratch, current approaches for weight initialization are based on random values. In this work we examine another approach to initialize the weights of convolutional neural network models for image classification. Our approach relies on presetting the weights of convolutional layers based on information given in the training images. To initialize the weights of convolutional layers we use small patches extracted from the training images to preset the filters of the convolutional layers. Experiments conducted on the MNIST, CIFAR-10 and CIFAR-100 dataset show that using image patches for the network initialization performs similar to state-of-the-art initialization approaches. The advantage is that our approach is more robust with respect to the learning rate. When a suboptimal value for the learning rate is used for training, our approach performs slightly better than current approaches. As a result, information given in the training images seems to be useful for network initialization resulting in a more robust training process.

Paper Nr: 61
Title:

Learning Joint Twist Rotation for 3D Human Pose Estimation from a Single Image

Authors:

Chihiro Nakatsuka, Jianfeng Xu and Kazuyuki Tasaka

Abstract: We consider monocular 3D human pose estimation with joint rotation representations because they are more expressive than joint location representations and better suited to some applications such as CG character control and human motion analysis in sports or surveillance. Previous approaches have encountered difficulties when estimating joint rotations with actual twist rotation around limbs. We present a novel approach to estimating joint rotations with actual twist rotations from a single image by handling joint rotations separately decomposed into swing and twist rotations. To extract twist rotations from an image, we emphasize the joint appearances and use them effectively in our model. Our model estimates the twist angles with an average radian error of 0.14, and we show that estimation of twist rotations achieves a more precise 3D human pose.

Paper Nr: 81
Title:

Multi-Stage Dynamic Batching and On-Demand I-Vector Clustering for Cost-effective Video Surveillance

Authors:

David Montero, Luis Unzueta, Jon Goenetxea, Nerea Aranjuelo, Estibaliz Loyo, Oihana Otaegui and Marcos Nieto

Abstract: In this paper, we present a cost-effective Video-Surveillance System (VSS) for face recognition and online clustering of unknown individuals at large scale. We aim to obtain Performance Indicators (PIs) for people flow monitoring in large infrastructures, without storing any biometric information. For this purpose, we focus on how to take advantage of a central GPU-enabled computing server, connected to a set of video-surveillance cameras, to automatically register new identities and update their descriptive data as they are re-identified. The proposed method comprises two main procedures executed in parallel. A Multi-Stage Dynamic Batching (MSDB) procedure efficiently extracts facial identity vectors (i-vectors) from captured images. At the same time, an On-Demand I-Vector Clustering (ODIVC) procedure clusters the i-vectors into identities. This clustering algorithm is designed to progressively adapt to the increasing data scale, with a lower decrease in its effectiveness compared to other alternatives. Experimental results show that ODIVC achieves state-of-the-art results in well-known large scale datasets and that our VSS can detect, recognize and cluster in real time faces coming from up to 40 cameras with a central off-the-shelf GPU-enabled computing server.

Paper Nr: 88
Title:

Road Lane Detection and Classification in Urban and Suburban Areas based on CNNs

Authors:

Nima Khairdoost, Steven S. Beauchemin and Michael A. Bauer

Abstract: Road lane detection systems play a crucial role in the context of Advanced Driver Assistance Systems (ADASs) and autonomous driving. Such systems can lessen road accidents and increase driving safety by alerting the driver in risky traffic situations. Additionally, the detection of ego lanes with their left and right boundaries along with the recognition of their types is of great importance as they provide contextual information. Lane detection is a challenging problem since road conditions and illumination vary while driving. In this contribution, we investigate the use of a CNN-based regression method for detecting ego lane boundaries. After the lane detection stage, following a projective transformation, the classification stage is performed with a RseNet101 network to verify the detected lanes or a possible road boundary. We applied our framework to real images collected during drives in an urban area with the RoadLAB instrumented vehicle. Our experimental results show that our approach achieved promising results in the detection stage with an accuracy of 94.52% in the lane classification stage.

Paper Nr: 94
Title:

Line2depth: Indoor Depth Estimation from Line Drawings

Authors:

Pavlov Sergey, Kanamori Yoshihiro and Endo Yuki

Abstract: Depth estimation from scenery line drawings has a number of applications, such as in painting software and 3D modeling. However, it has not received much attention because of the inherent ambiguity of line drawings. This paper proposes the first CNN-based method for estimating depth from single line drawings of indoor scenes. First, to combat the ambiguity of line drawings, we enrich the input line drawings by hallucinating colors, rough depth, and normal maps using a conditional GAN. Next, we obtain the final depth maps from the hallucinated data and input line drawings using a CNN for depth estimation. Our qualitative and quantitative evaluations demonstrate that our method works significantly better than conventional photo-aimed methods trained only with line drawings. Additionally, we confirmed that our results with hand-drawn indoor scenes are promising for use in practical applications.

Paper Nr: 115
Title:

LAMV: Learning to Predict Where Spectators Look in Live Music Performances

Authors:

Arturo Fuentes, F. J. Sánchez, Thomas Voncina and Jorge Bernal

Abstract: The advent of artificial intelligence has supposed an evolution on how different daily work tasks are performed. The analysis of cultural content has seen a huge boost by the development of computer-assisted methods that allows easy and transparent data access. In our case, we deal with the automation of the production of live shows, like music concerts, aiming to develop a system that can indicate the producer which camera to show based on what each of them is showing. In this context, we consider that is essential to understand where spectators look and what they are interested in so the computational method can learn from this information. The work that we present here shows the results of a first preliminary study in which we compare areas of interest defined by human beings and those indicated by an automatic system. Our system is based on the extraction of motion textures from dynamic Spatio-Temporal Volumes (STV) and then analyzing the patterns by means of texture analysis techniques. We validate our approach over several video sequences that have been labeled by 16 different experts. Our method is able to match those relevant areas identified by the experts, achieving recall scores higher than 80% when a distance of 80 pixels between method and ground truth is considered. Current performance shows promise when detecting abnormal peaks and movement trends.

Paper Nr: 153
Title:

Convolutional Neural Networks with Fixed Weights

Authors:

Tyler C. Folsom

Abstract: Improved computational power has enabled artificial neural networks to achieve great success through deep learning. However, visual classification is brittle; networks can be easily confused when a small amount of noise is added to an image. This position paper raises the hypothesis that using all the pixels of an image is wasteful of resources and unstable. Biological neural networks achieve greater success, and the outline of their architecture is well understood and reviewed in this paper. It would behove deep learning network architectures to take additional inspiration from biology to reduce the dimensionality of images and video. Pixels strike the retina, but are convolved before they get to the brain. It has been demonstrated that a set of five filters retains key visual information while achieving compression by an order of magnitude. This paper presents those filters. We propose that images should be pre-processed with a fixed weight convolution that mimics the filtering performed in the retina and primary visual cortex. Deep learning would then be applied to the smaller filtered image.

Paper Nr: 155
Title:

Extracting Accurate Long-term Behavior Changes from a Large Pig Dataset

Authors:

Luca Bergamini, Stefano Pini, Alessandro Simoni, Roberto Vezzani, Simone Calderara, Rick B. D’Eath and Robert B. Fisher

Abstract: Visual observation of uncontrolled real-world behavior leads to noisy observations, complicated by occlusions, ambiguity, variable motion rates, detection and tracking errors, slow transitions between behaviors, etc. We show in this paper that reliable estimates of long-term trends can be extracted given enough data, even though estimates from individual frames may be noisy. We validate this concept using a new public dataset of approximately 20+ million daytime pig observations over 6 weeks of their main growth stage, and we provide annotations for various tasks including 5 individual behaviors. Our pipeline chains detection, tracking and behavior classification combining deep and shallow computer vision techniques. While individual detections may be noisy, we show that long-term behavior changes can still be extracted reliably, and we validate these results qualitatively on the full dataset. Eventually, starting from raw RGB video data we are able to both tell what pigs main daily activities are, and how these change through time.

Paper Nr: 183
Title:

U-Net based Zero-hour Defect Inspection of Electronic Components and Semiconductors

Authors:

Florian Kälber, Okan Köpüklü, Nicolas Lehment and Gerhard Rigoll

Abstract: Automated visual inspection is a popular way of detecting many kind of defects at PCBs and electronic components without intervening in the manufacturing process. In this work, we present a novel approach for anomaly detection of PCBs where a U-Net architecture performs binary anomalous region segmentation and DBSCAN algorithm detects and localizes individual defects. At training time, reference images are needed to create annotations of anomalous regions, whereas at test time references images are not needed anymore. The proposed approach is validated on DeepPCB dataset and our internal chip defect dataset. We have achieved 0.80 and 0.75 mean Intersection of Union (mIoU) scores on DeepPCB and chip defect datasets, respectively, which demonstrates the effectiveness of the proposed approach. Moreover, for optimized and reduced models with computational costs lower than one giga FLOP, mIoU scores of 0.65 and above are achieved justifying the suitability of the proposed approach for embedded and potentially real-time applications.

Paper Nr: 203
Title:

Temporal Bilinear Encoding Network of Audio-visual Features at Low Sampling Rates

Authors:

Feiyan Hu, Eva Mohedano, Noel O’Connor and Kevin Mcguinness

Abstract: Current deep learning based video classification architectures are typically trained end-to-end on large volumes of data and require extensive computational resources. This paper aims to exploit audio-visual information in video classification with a 1 frame per second sampling rate. We propose Temporal Bilinear Encoding Networks (TBEN) for encoding both audio and visual long range temporal information using bilinear pooling and demonstrate bilinear pooling is better than average pooling on the temporal dimension for videos with low sampling rate. We also embed the label hierarchy in TBEN to further improve the robustness of the classifier. Experiments on the FGA240 fine-grained classification dataset using TBEN achieve a new state-of-the-art (hit@1=47.95%). We also exploit the possibility of incorporating TBEN with multiple decoupled modalities like visual semantic and motion features: experiments on UCF101 sampled at 1 FPS achieve close to state-of-the-art accuracy (hit@1=91.03%) while requiring significantly less computational resources than competing approaches for both training and prediction.

Paper Nr: 205
Title:

S*ReLU: Learning Piecewise Linear Activation Functions via Particle Swarm Optimization

Authors:

Mina Basirat and Peter M. Roth

Abstract: Recently, it has been shown that properly parametrized Leaky ReLU (LReLU) as an activation function yields significantly better results for a variety of image classification tasks. However, such methods are not feasible in practice. Either the only parameter (i.e., the slope of the negative part) needs to be set manually (L*ReLU), or the approach is vulnerable due to the gradient-based optimization and, thus, highly dependent on a proper initialization (PReLU). In this paper, we would like to exploit the benefits of piecewise linear functions, avoiding these problems. To this end, we propose a fully automatic approach to estimate the slope parameter for LReLU from the data. We realize this via Stochastic Optimization, namely Particle Swarm Optimization (PSO): S*ReLU. In this way, we can show that, compared to widely-used activation functions (including PReLU), better results can be obtained on seven different benchmark datasets. Moreover, the results even match those of L*ReLU, where the optimal parameter is estimated in a brute-force manner. In this way, our fully-automatic approach allows for drastically reducing the computational effort.

Paper Nr: 229
Title:

Deep Emotion Recognition through Upper Body Movements and Facial Expression

Authors:

Chaudhary A. Ilyas, Rita Nunes, Kamal Nasrollahi, Matthias Rehm and Thomas B. Moeslund

Abstract: Despite recent significant advancements in the field of human emotion recognition, applying upper body movements along with facial expressions present severe challenges in the field of human-robot interaction. This article presents a model that learns emotions through upper body movements and corresponds with facial expressions. Once this correspondence is mapped, tasks such as emotion and gesture recognition can easily be identified using facial features and movement vectors. Our method uses a deep convolution neural network trained on benchmark datasets exhibiting various emotions and corresponding body movements. Features obtained through facial movements and body motion are fused to get emotion recognition performance. We have implemented various fusion methodologies to integrate multimodal features for non-verbal emotion identification. Our system achieves 76.8% accuracy of emotion recognition through upper body movements only, surpassing 73.1% on the FABO dataset. In addition, employing multimodal compact bilinear pooling with temporal information surpassed the state-of-the-art method with an accuracy of 94.41% on the FABO dataset. This system can lead to better human-machine interaction by enabling robots to recognize emotions and body actions and react according to their emotions, thus enriching the user experience.

Paper Nr: 231
Title:

Efficient Multi-task based Facial Landmark and Gesture Detection in Monocular Images

Authors:

Jon Goenetxea, Luis Unzueta, Unai Elordi, Oihana Otaegui and Fadi Dornaika

Abstract: The communication between persons includes several channels to exchange information between individuals. The non-verbal communication contains valuable information about the context of the conversation and it is a key element to understand the entire interaction. The facial expressions are a representative example of this kind of non-verbal communication and a valuable element to improve human-machine interaction interfaces. Using images captured by a monocular camera, automatic facial analysis systems can extract facial expressions to improve human-machine interactions. However, there are several technical factors to consider, including possible computational limitations (e.g. autonomous robots), or data throughput (e.g. centralized computation server). Considering the possible limitations, this work presents an efficient method to detect a set of 68 facial feature points and a set of key facial gestures at the same time. The output of this method includes valuable information to understand the context of communication and improve the response of automatic human-machine interaction systems.

Area 4 - Applications and Services

Full Papers
Paper Nr: 24
Title:

Evaluation of Knee Implant Alignment using Radon Transformation

Authors:

Guillaume Pascal, Andreas Møgelmose and Andreas Kappel

Abstract: In this paper we present a method for automatically computing the angles between bones and implants after a knee replacement surgery (Total Knee Arthroplasty, TKA), along with the world’s first public dataset of TKA radiographs, complete with ground truth angle annotations. We use the Radon transform to determine the angles of the relevant bones and implants, and obtain 94.9% measurements within 2º. This beats the current state-of-the-art by 2.9%. The system is thus ready to be used in assisting surgeons and replacing time consuming and observer dependent manual measurements.

Paper Nr: 166
Title:

SALT: A Semi-automatic Labeling Tool for RGB-D Video Sequences

Authors:

Dennis Stumpf, Stephan Krauß, Gerd Reis, Oliver Wasenmüller and Didier Stricker

Abstract: Large labeled data sets are one of the essential basics of modern deep learning techniques. Therefore, there is an increasing need for tools that allow to label large amounts of data as intuitively as possible. In this paper, we introduce SALT, a tool to semi-automatically annotate RGB-D video sequences to generate 3D bounding boxes for full six Degrees of Freedom (DoF) object poses, as well as pixel-level instance segmentation masks for both RGB and depth. Besides bounding box propagation through various interpolation techniques, as well as algorithmically guided instance segmentation, our pipeline also provides built-in pre-processing functionalities to facilitate the data set creation process. By making full use of SALT, annotation time can be reduced by a factor of up to 33.95 for bounding box creation and 8.55 for RGB segmentation without compromising the quality of the automatically generated ground truth.

Paper Nr: 197
Title:

Quantitative Analysis of Skin using Diffuse Reflectance for Non-invasive Pigments Detection

Authors:

Shiwei Li, Mohsen Ardabilian and Abdelmalek Zine

Abstract: Skin diagnosis has become a significant part of research topics in biomedical engineering and informatics, since many conditions or symptoms of diseases, such as melanoma and jaundice, are indicated by skin appearance. In the past, an invasive method (i.e. Biopsy) is widely used for pathological diagnosis by removing a small amount of living tissue. Recently, non-invasive methods have been studied based on diffuse reflectance for detecting skin inner information. With the development of machine learning techniques, non-invasive methods can be further improved in many aspects, such as the speed and accuracy. Our research focuses on analyzing and improving non-invasive skin pigments detection using neural networks. The relation between skin pigments content and skin diffuse reflectance has been studied. Moreover, the computational time has been accelerated significantly after using the inverse mapping neural network instead of the forward mapping one. The results show that our proposed method can obtain favorable results in estimating melanin content, blood content, and oxygen saturation from synthetic skin diffuse reflectance for all lightly, moderately, and darkly pigmented skin types compared to Monte Carlo simulations. And it turns out that our method works well when using a measured skin reflectance database from National Institute of Standards and Technology for the second validation.

Paper Nr: 210
Title:

3D Fetal Face Reconstruction from Ultrasound Imaging

Authors:

Antònia Alomar, Araceli Morales, Kilian Vellvé, Antonio R. Porras, Fatima Crispi, Marius G. Linguraru, Gemma Piella and Federico Sukno

Abstract: The fetal face contains essential information in the evaluation of congenital malformations and the fetal brain function, as its development is driven by genetic factors at early stages of embryogenesis. Three-dimensional ultrasound (3DUS) can provide information about the facial morphology of the fetus, but its use for prenatal diagnosis is challenging due to imaging noise, fetal movements, limited field-of-view, low soft-tissue contrast, and occlusions. In this paper, we propose a fetal face reconstruction algorithm from 3DUS images based on a novel statistical morphable model of newborn faces, the BabyFM. We test the feasibility of using newborn statistics to accurately reconstruct fetal faces by fitting the regularized morphable model to the noisy 3DUS images. The algorithm is capable of reconstructing the whole facial morphology of babies from one or several ultrasound scans to handle adverse conditions (e.g. missing parts, noisy data), and it has the potential to aid in-utero diagnosis for conditions that involve facial dysmorphology.

Short Papers
Paper Nr: 55
Title:

A Lightweight Secure Image Super Resolution using Network Coding

Authors:

Quoc-Tuan Vien, Tuan T. Nguyen and Huan X. Nguyen

Abstract: Images play an important part in our daily life. They convey our personal stories and maintain meaningful objects, events, emotions etc. People, therefore, mostly use images as visual information for their communication with each other. Data size and privacy are, however, two of important aspects whilst transmitting data through network like internet, i.e. the time prolongs when the amount of data are increased and the risk of exposing private data when being captured and accessed by irrelevant people. In this paper, we introduce a unified framework, namely Deep-NC, to address these problems seamlessly. Our method contains three important components: the first component, adopted from Random Linear Network Coding (RLNC), to protect the sharing of private image from the eavesdropper; the second component to remove noise causing to image data due to transmission over wireless media; and the third component, utilising Image Super-Resolution (ISR) with Deep Learning (DL), to recover high-resolution images from low-resolution ones due to image sizes reduced. This is a general framework in which each component can be enhanced by sophisticated methods. Simulation results show that an outperformance of up to 32 dB, in terms of Peak Signal-to-Noise Ratio (PSNR), can be obtained when the eavesdropper does not have any knowledge of parameters and the reference image used in the mixing schemes. Various impacts of the method are deeply evaluated to show its effectiveness in securing transmitted images. Furthermore, the original image is shown to be able to downscale to a much lower resolution for saving significantly the transmission bandwidth with negligible performance loss.

Paper Nr: 107
Title:

Multi-level Quality Assessment of Retinal Fundus Images using Deep Convolution Neural Networks

Authors:

Satya M. Muddamsetty and Thomas B. Moeslund

Abstract: Retinal fundus image quality assessment is one of the major steps in screening for retinal diseases, since the poor-quality retinal images do not allow an accurate medical diagnosis. In this paper, we first introduce a large multi-level Retinal Fundus Image Quality Assessment (RFIQA) dataset. It has six levels of quality grades, which are based on important regions to consider for diagnosing diabetic retinopathy (DR), Aged Macular Degeneration (AMD) and Glaucoma by ophthalmologists. Second, we propose a Convolution Neural Network (CNN) model to assess the quality of the retinal images with much fewer parameters than existing deep CNN models and finally we propose to combine deep and generic texture features, and using Random Forest classifier. Experiments show that combing both deep and generic features outperforms using any of the two feature types in isolation. This is confirmed on our new dataset as well as on other public datasets.

Paper Nr: 114
Title:

Supporting Detection of Near and Far Pedestrians in a Collision Prediction System

Authors:

Lucas S. Cambuim and Edna Barros

Abstract: This paper proposes a multi-window-based detector to locate pedestrians near and distant. This detector is introduced in a pedestrian collision prediction (PCP) system. We developed an evaluation strategy for the proposed PCP system based on a synthetic collision database, which allowed us to analyze collision prediction quality improvements. Results demonstrate that the combination of different window subdetectors outperforms individual subdetectors’ accuracy and YOLO-based detector. Once our system achieved a processing rate of 30 FPS when processing images in HD resolution, results demonstrated an increase in the number of scenarios that the system could entirely avoid a collision compared to a YOLO-based system.

Paper Nr: 122
Title:

Classification of Normal versus Leukemic Cells with Data Augmentation and Convolutional Neural Networks

Authors:

José E. Maurício de Oliveira and Daniel O. Dantas

Abstract: Acute lymphoblastic leukemia is the most common childhood leukemia. It is an aggressive cancer type and causes various health problems. Diagnosis depends on manual microscopic analysis of blood samples by expert hematologists and pathologists. To assist these professionals, image processing and pattern recognition techniques can be used. This work proposes simple modifications to standard neural network architectures to achieve high performance in the malignant leukocyte classification problem. The tested architectures were VGG16, VGG19 and Xception. Data augmentation was employed to balance the Training and Validation sets. Transformations such as mirroring, rotation, blurring, shearing, and addition of salt and pepper noise were used. The proposed method achieved an F1-score of 92.60%, the highest one when compared to other participants’ published results and eighth position when compared to the weighted F1-score provided by the competition leaderboard.

Paper Nr: 163
Title:

An User-centred AI-based Assistance System to Encounter Pandemics in Clinical Environments: A Concept Overview

Authors:

Christian Wiede, Roman Seidel, Carolin Wuerich, Damir Haskovic, Gangolf Hirtz and Anton Grabmaier

Abstract: The current coronavirus pandemic has highlighted the need for enhanced digital technologies to provide high quality care to patients in hospitals while protecting the health and safety of the medical staff. It can also be expected that there will be a second and third wave in the corona pandemic and that preparation for future pandemics must be made. In order to close this emerging gap, we propose a concept aiming at boosting the adoption of AI and robotic related technologies to ensure sustainable, patient-centred care in hospitals. The planned assistance system will provide a continuous and safe monitoring of patients in the whole hospital environment from entrance to the ward, including data security and protection. The benefits consist in a fast detection of possible infected persons, a continuous monitoring of patients, a support by robots to reduce physical contacts during epidemics, and an automatic disinfection by robots. In addition to the technical challenges, medical, social and economic challenges for such an assistance system are discussed.

Paper Nr: 3
Title:

Multi-view Real-time 3D Occupancy Map for Machine-patient Collision Avoidance

Authors:

Timothy Callemein, Kristof Van Beeck and Toon Goedemé

Abstract: Nowadays - due to advancements in technology - cooperative robots (or cobots) find their way outside the more traditional industrial context. They are used for example in medical scenarios during operations or scanning of patients. Evidently, these scenarios require sufficient safety measures. In this work, we focus on the scenario of an X-ray scanner room, equipped with several cobots (mobile scanner, adjustable tabletop and wall stand) where both patients and medical staff members can walk around freely. We propose an approach to calculate a 3D safeguard zone around people that can be used to restrict the movement of the cobots to prevent collisions. For this, we rely on four ceiling-mounted cameras. The goal of this work is to develop an accurate system with minimal latency at limited hardware costs. To calculate the 3D safeguard zone we propose to use CNN people detection or segmentation techniques to provide the silhouette input needed to calculate a 3D visual hull. We evaluate several state-of-the-art techniques in the search of the optimal trade-off between speed and accuracy. Our research shows that it is possible to achieve acceptable performance processing four cameras with a latency of 125ms with a precision of 54% at a recall of 75%, using the YOLACT++ model.

Paper Nr: 18
Title:

AR-Bot, a Centralized AR-based System for Relocalization and Home Robot Navigation

Authors:

Matthieu Fradet, Caroline Baillard, Vincent Alleaume, Pierrick Jouet, Anthony Laurent and Tao Luo

Abstract: We describe a system enabling to assign navigation tasks to a self-moving robot in a domestic environment, using an Augmented Reality application running on a consumer-grade mobile phone. The system is composed of a robot, one or several mobile phones, a robot controller and a central server. The server embeds automatic processing modules for 3D scene modeling and for device relocalization. The user points at a target location in the phone camera view and the robot automatically moves to the designated point. The user is assisted with AR-based visual feedback all along the experience. The novelty of the system lies in the automatic relocalization of both the robot and the phone: they are independently located in the 3D space thanks to registration methods running on the server, hence they do not need to be explicitly spatially registered to each other nor in direct line of sight. In the paper we provide details on the general architecture and the different modules that are needed to get a fully functional prototype. The proposed solution was designed to be easily extended and may be seen as a general architecture supporting intuitive AR interfaces for in home devices interactions.

Paper Nr: 29
Title:

NAND-measure: An Android App for Marker-based Spatial Measurement

Authors:

Maik Benndorf, Maximilian Jugl, Thomas Haenselmann and Martin Gaedke

Abstract: In a disaster scenario, a quick decision must be made whether a bridge is stable enough to be used. The natural frequencies of a bridge can provide information about its condition. The actual frequencies (e.g. measured by the acceleration sensor built into a smartphone) must be compared with the desired frequencies. The desired frequencies can be approximated for example with the Finite Element Method (FEM).Among other parameters, the FEM requires the dimensions of the bridge. Numerous applications for spatial measurement for different purposes are offered in the mobile App-stores. However, most of these apps are limited to short distances of up to five meters and are not suitable for the aforementioned scenario. In this article, we present NAND-Measure - an application for spatial measurements for short distances (below one meter) up to distances of 50 meters. Two methods have been implemented - a stereoscopic approach with a single camera and an approach based on the pinhole camera model. Both methods were evaluated by taking sixty over different distances. Overall, the approach based on the pinhole camera model was more accurate and showed smaller deviations.

Paper Nr: 118
Title:

Faster R-CNN Approach for Diabetic Foot Ulcer Detection

Authors:

Artur C. Oliveira, André Britto de Carvalho and Daniel O. Dantas

Abstract: Diabetic Foot Ulcer (DFU) is one of the major health concerns about Diabetes. These injuries impair the patient’s quality of life, bring high costs to public health, and can even lead to limb amputations. The use of automatic tools for detection can assists specialists in the prevention and treatment of the disease. Some methods to address this problem based on machine learning have recently been presented. This article proposes the use of deep learning techniques to assist the treatment of DFUs, more specifically, the detection of ulcers through photos taken from the patient’s feet. We propose an improvement of the original Faster R-CNN using data augmentation techniques and changes in parameter settings. We used a training dataset with 2000 images of DFUs annotated by specialists. The training was validated using the Monte Carlo cross-validation technique. Our proposal achieved a mean average precision of 91.4%, a F1-score of 94.8%, and an average detection speed of 332ms which outperformed traditional detector implementations.

Paper Nr: 213
Title:

Quantification of Uncertainty in Brain Tumor Segmentation using Generative Network and Bayesian Active Learning

Authors:

Rasha Alshehhi and Anood Alshehhi

Abstract: Convolutional neural networks have shown great potential in medical segmentation problems, such as brain-tumor segmentation. However, little consideration has been given to generative adversarial networks and uncertainty quantification over the output images. In this paper, we use the generative adversarial network to handle limited labeled images. We also quantify the modeling uncertainty by utilizing Bayesian active learning to reduce untoward outcomes. Bayesian active learning is dependent on selecting uncertain images using acquisition functions to increase accuracy. We introduce supervised acquisition functions based on distance functions between ground-truth and predicted images to quantify segmentation uncertainty. We evaluate the method by comparing it with the state-of-the-art methods based on Dice score, Hausdorff distance and sensitivity. We demonstrate that the proposed method achieves higher or comparable performance to state-of-the-art methods for brain tumor segmentation (on BraTS 2017, BraTS 2018 and BraTS 2019 datasets).

Paper Nr: 216
Title:

AI-assisted Automated Pipeline for Length Estimation, Visual Assessment of the Digestive Tract and Counting of Shrimp in Aquaculture Production

Authors:

Yousif Hashisho, Tim Dolereit, Alexandra Segelken-Voigt, Ralf Bochert and Matthias Vahl

Abstract: Shrimp farming is a century-old practice in aquaculture production. In the past years, some improvements of the traditional farming methods have been made, however, it still involves mostly intensive manual work, which makes traditional farming a neither time nor cost efficient production process. Therefore, a continuous monitoring approach is required for increasing the efficiency of shrimp farming. This paper proposes a pipeline for automated shrimp monitoring using deep learning and image processing methods. The automated monitoring includes length estimation, assessment of the shrimp’s digestive tract and counting. Furthermore, a mobile system is designed for monitoring shrimp in various breeding tanks. This study shows promising results and unfolds the potential of artificial intelligence in automating shrimp monitoring.

Paper Nr: 219
Title:

On-demand Serverless Video Surveillance with Optimal Deployment of Deep Neural Networks

Authors:

Unai Elordi, Luis Unzueta, Jon Goenetxea, Estíbaliz Loyo, Ignacio Arganda-Carreras and Oihana Otaegui

Abstract: We present an approach to optimally deploy Deep Neural Networks (DNNs) in serverless cloud architectures. A serverless architecture allows running code in response to events, automatically managing the required computing resources. However, these resources have limitations in terms of execution environment (CPU only), cold starts, space, scalability, etc. These limitations hinder the deployment of DNNs, especially considering that fees are charged according to the employed resources and the computation time. Our deployment approach is comprised of multiple decoupled software layers that allow effectively managing multiple processes, such as business logic, data access, and computer vision algorithms that leverage DNN optimization techniques. Experimental results in AWS Lambda reveal its potential to build cost-effective on-demand serverless video surveillance systems.

Paper Nr: 242
Title:

A Cone Beam Computed Tomography Annotation Tool for Automatic Detection of the Inferior Alveolar Nerve Canal

Authors:

Cristian Mercadante, Marco Cipriano, Federico Bolelli, Federico Pollastri, Mattia Di Bartolomeo, Alexandre Anesi and Costantino Grana

Abstract: In recent years, deep learning has been employed in several medical fields, achieving impressive results. Unfortunately, these algorithms require a huge amount of annotated data to ensure the correct learning process. When dealing with medical imaging, collecting and annotating data can be cumbersome and expensive. This is mainly related to the nature of data, often three-dimensional, and to the need for well-trained expert technicians. In maxillofacial imagery, recent works have been focused on the detection of the Inferior Alveolar Nerve (IAN), since its position is of great relevance for avoiding severe injuries during surgery operations such as third molar extraction or implant installation. In this work, we introduce a novel tool for analyzing and labeling the alveolar nerve from Cone Beam Computed Tomography (CBCT) 3D volumes.

Area 5 - Motion, Tracking and Stereo Vision

Full Papers
Paper Nr: 33
Title:

A Lightweight Real-time Stereo Depth Estimation Network with Dynamic Upsampling Modules

Authors:

Yong Deng, Jimin Xiao and Steven Z. Zhou

Abstract: Deep learning based stereo matching networks achieve great success in the depth estimation from stereo image pairs. However, current state-of-the-art methods usually are computationally intensive, which prevents them from being applied in real-time scenarios or on mobile platforms with limited computational resources. In order to tackle this shortcoming, we propose a lightweight real-time stereo matching network for disparity estimation. Our network adopts the efficient hierarchical Coarse-To-Fine (CTF) matching scheme, which starts matching from the low-resolution feature maps, and then upsamples and refines the previous disparity stage by stage until the full resolution. We can take the result of any stage as output to trade off accuracy and runtime. We propose an efficient hourglass-shaped feature extractor based on the latest MobileNet V3 to extract multi-resolution feature maps from stereo image pairs. We also propose to replace the traditional upsampling method in the CTF matching scheme with the learning-based dynamic upsampling modules to avoid blurring effects caused by conventional upsampling methods. Our model can process 1242 x 375 resolution images with 35-68 FPS on a GeForce GTX 1660 GPU, and outperforms all competitive baselines with comparable runtime on the KITTI 2012/2015 datasets.

Paper Nr: 38
Title:

BirdSLAM: Monocular Multibody SLAM in Bird’s-eye View

Authors:

Swapnil Daga, Gokul B. Nair, Anirudha Ramesh, Rahul Sajnani, Junaid A. Ansari and K. M. Krishna

Abstract: In this paper, we present BirdSLAM, a novel simultaneous localization and mapping (SLAM) system for the challenging scenario of autonomous driving platforms equipped with only a monocular camera. BirdSLAM tackles challenges faced by other monocular SLAM systems (such as scale ambiguity in monocular reconstruction, dynamic object localization, and uncertainty in feature representation) by using an orthographic (bird’s-eye) view as the configuration space in which localization and mapping are performed. By assuming only the height of the ego-camera above the ground, BirdSLAM leverages single-view metrology cues to accurately localize the ego-vehicle and all other traffic participants in bird’s-eye view. We demonstrate that our system outperforms prior work that uses strictly greater information, and highlight the relevance of each design decision via an ablation analysis.

Paper Nr: 113
Title:

Independently Moving Object Trajectories from Sequential Hierarchical Ransac

Authors:

Mikael Persson and Per-Erik Forssén

Abstract: Safe robot navigation in a dynamic environment, requires the trajectories of each independently moving object (IMO). We present the novel and effective system Sequential Hierarchical Ransac Estimation (Shire) designed for this purpose. The system uses a stereo camera stream to find the objects and trajectories in real time. Shire detects moving objects using geometric consistency and finds their trajectories using bundle adjustment. Relying on geometric consistency allows the system to handle objects regardless of semantic class, unlike approaches based on semantic segmentation. Most Visual Odometry (VO) systems are inherently limited to single motion by the choice of tracker. This limitation allows for efficient and robust ego-motion estimation in real time, but preclude tracking the multiple motions sought. Shire instead uses a generic tracker and achieves accurate VO and IMO estimates using track analysis. This removes the restriction to a single motion while retaining the real-time performance required for live navigation. We evaluate the system by bounding box intersection over union and ID persistence on a public dataset, collected from an autonomous test vehicle driving in real traffic. We also show the velocities of estimated IMOs. We investigate variations of the system that provide trade offs between accuracy, performance and limitations.

Paper Nr: 204
Title:

A Benchmark for 3D Reconstruction from Aerial Imagery in an Urban Environment

Authors:

Susana Ruano and Aljosa Smolic

Abstract: This paper presents a novel benchmark to evaluate 3D reconstruction methods using aerial images in a large-scale urban scenario. In particular, it presents an evaluation of open-source state-of-the-art pipelines for image-based 3D reconstruction including, for the first time, an analysis per urban object category. Therefore, the standard evaluation presented in generalist image-based reconstruction benchmarks is extended and adapted to the city. Furthermore, our benchmark uses the densest annotated LiDAR point cloud available at city scale as ground truth and the imagery captured alongside. Additionally, an online evaluation server will be made available to the community.

Paper Nr: 218
Title:

Normalized Convolution Upsampling for Refined Optical Flow Estimation

Authors:

Abdelrahman Eldesokey and Michael Felsberg

Abstract: Optical flow is a regression task where convolutional neural networks (CNNs) have led to major breakthroughs. However, this comes at major computational demands due to the use of cost-volumes and pyramidal representations. This was mitigated by producing flow predictions at quarter the resolution, which are upsampled using bilinear interpolation during test time. Consequently, fine details are usually lost and post-processing is needed to restore them. We propose the Normalized Convolution UPsampler (NCUP), an efficient joint upsampling approach to produce the full-resolution flow during the training of optical flow CNNs. Our proposed approach formulates the upsampling task as a sparse problem and employs the normalized convolutional neural networks to solve it. We evaluate our upsampler against existing joint upsampling approaches when trained end-to-end with a a coarse-to-fine optical flow CNN (PWCNet) and we show that it outperforms all other approaches on the FlyingChairs dataset while having at least one order fewer parameters. Moreover, we test our upsampler with a recurrent optical flow CNN (RAFT) and we achieve state-of-the-art results on Sintel benchmark with ∼ 6% error reduction, and on-par on the KITTI dataset, while having 7.5% fewer parameters (see Figure 1). Finally, our upsampler shows better generalization capabilities than RAFT when trained and evaluated on different datasets.

Paper Nr: 224
Title:

Investigating 3D Convolutional Layers as Feature Extractors for Anomaly Detection Systems Applied to Surveillance Videos

Authors:

Tiago S. Nazare, Rodrigo F. de Mello and Moacir A. Ponti

Abstract: Over the last few years, several strategies have been leveraged to detect unusual behavior in surveillance videos. Nonetheless, there are still few studies that compare strategies based on 3D Convolutional Neural Networks to tackle such problem. This research gap has motivated the this work in which we aim at investigating the features from a pre-trained C3D model and the training of fully 3D-convolutional auto-encoders for automated video anomaly detection systems, comparing them with respect to the anomaly detection performance and the processing power demands. Additionally, we present an auto-encoder model to detect anomalous behavior based on the pixel reconstruction error. While C3D features coming from the first layers were shown to be both better descriptors and faster to be computed, the auto-encoder achieved results comparable to the C3D, while requiring less computational effort. When compared to other studies using two benchmark datasets, the proposed methods are comparable to the state-of-the-art for the Ped2 dataset, while inferior when detecting anomalies on the Ped1 dataset. Additionally, our experimental results support the development of future 3D-CNN-based anomaly detection methods.

Paper Nr: 225
Title:

Real-time Monocular 6DoF Tracking of Textureless Objects using Photometrically-enhanced Edges

Authors:

Lucas Valença, Luca Silva, Thiago Chaves, Arlindo Gomes, Lucas Figueiredo, Lucio Cossio, Sebastien Tandel, João P. Lima, Francisco Simões and Veronica Teichrieb

Abstract: We propose a novel real-time edge-based 6DoF tracking approach for 3D rigid objects requiring just a monocular RGB camera and a CAD model with material information. The technique is aimed at low-texture or textureless, pigmented objects. It works even when under strong illumination, motion, and occlusion challenges. We show how preprocessing the model’s texture can improve tracking and apply region-based ideas like localized segmentation to improve the edge-based pipeline. This way, our technique is able to find model edges even under fast motion and in front of high-gradient backgrounds. Our implementation runs on desktop and mobile. It only requires one CPU thread per object tracked simultaneously and requires no GPU. It showcases a drastically reduced memory footprint when compared to the state of the art. To show how our technique contributes to the state of the art, we perform comparisons using two publicly available benchmarks.

Short Papers
Paper Nr: 21
Title:

Real-time and Online Segmentation Multi-target Tracking with Track Revival Re-identification

Authors:

Martin Ahrnbom, Mikael Nilsson and Håkan Ardö

Abstract: The first online segmentation multi-target tracking algorithm with reported real-time speeds is presented. Based on the popular and fast bounding box based tracker SORT, our method called SORTS is able to utilize segmentations for tracking while keeping the real-time speeds. To handle occlusions, which neither SORT nor SORTS do, we also present SORTS+RReID, an optional extension which uses ReID vectors to revive lost tracks from SORTS to handle occlusions. Despite only computing ReID vectors for 6.9% of the detections, ID switches are decreased by 45%. We evaluate on the MOTS dataset and run at 54.5 and 36.4 FPS for SORTS and SORT+RReID respectively, while keeping 78-79% of the sMOTSA of the current state of the art, which runs at 0.3 FPS. Furthermore, we include an experiment using a faster instance segmentation method to explore the feasibility of a complete real-time detection and tracking system. Code is available: https://github.com/ahrnbom/sorts.

Paper Nr: 48
Title:

Object based Hybrid Video Compression

Authors:

Rhoda Gbadeyan and Chris Joslin

Abstract: Standard video compression techniques have provided pixel-based solutions that have achieved high compression performance. However, with new application areas such as streaming, ultra-high definition TV(UHDTV) etc., expectations of end user applications are at an all-time high. Never the less, the issue of stringent memory and bandwidth optimization remains. Therefore, there is a need to further optimize the performance of standard video codecs to provide more flexibility to content providers on how to encode video. In this paper, we propose replacing pixels with objects as the unit of compression while still harnessing the advantages of standard video codecs thereby reducing the bits required to represent a video scene while still achieving suitable visual quality in compressed videos. Test results indicate that the proposed algorithm provides a viable hybrid video coding solution for applications where pixel level precision is not required.

Paper Nr: 147
Title:

Modeling a priori Unknown Environments: Place Recognition with Optical Flow Fingerprints

Authors:

Zachary Mueller and Sotirios Diamantas

Abstract: In this research we present a novel method for place recognition that relies on optical flow fingerprints of features. We make no assumptions about the properties of features or the environment such as color, shape, and size, as we approach the problem parsimoniously with a single camera mounted on a robot. In the training phase of our algorithm an accurate camera model is utilized to model and simulate the optical flow vector magnitudes with respect to velocity and distance to features. A lognormal distribution function, that is the result of this observation, is used as an input during the testing phase that is taking place with real sensors and features extracted using Lucas-Kanade optical flow algorithm. With this approach we have managed to bridge the gap between simulation and real-world environments by transferring the output of simulated training data sets to real testing environments. In addition, our method is highly adaptable to different types of sensors and environments. Our algorithm is evaluated both in indoor and outdoor environments where a robot revisits places from different poses and velocities demonstrating that modeling an unknown environment using optical flow properties is feasible yet efficient.

Paper Nr: 150
Title:

Quantifying Wind Turbine Blade Surface Roughness using Sandpaper Grit Sizes: An Initial Exploration

Authors:

Ivan Nikolov and Claus Madsen

Abstract: Surface inspection of wind turbine blades is a necessary step, to ensure longevity and sustained high energy output. The detection of accumulation of damages and increased surface roughness of in-use blades, is one of the main objectives of inspections in the wind energy industry. Creating 3D scans of the leading edges of blade surfaces has been more and more used for capturing the roughness profile of blades. An important part in analysing these surface 3D scans is the standardization of the captured data across different blade surfaces, types and sizes. In this paper we propose an initial exploration of using sandpaper grit sizes to provide this standardization. Sandpaper has been widely used for approximating different levels of blade surface roughness and its standardized nature can be used to easily describe and compare blade surfaces. We reconstruct a number of different sandpaper grit sizes - from coarser P40 to a finer P180. We extract a number of 3D surface features from them and use them to train a random forest classification method. This method is then used to segment the surfaces of wind turbine blades in areas of different surface roughness. We test our proposed solution on a variety of blade surfaces - from smooth to course and damaged and show that it manages to classify them depending on their roughness.

Paper Nr: 193
Title:

A Study on the Influence of Omnidirectional Distortion on CNN-based Stereo Vision

Authors:

Julian B. Seuffert, Ana P. Grassi, Tobias Scheck and Gangolf Hirtz

Abstract: Stereo vision is one of the most prominent strategies to reconstruct a 3D scene with computer vision techniques. With the advent of Convolutional Neural Networks (CNN), stereo vision has undergone a breakthrough. Always more works attend to recover the depth information from stereo images by using CNNs. However, most of the existing approaches are developed for images captured with perspective cameras. Perspective cameras have a very limited field of view of around 60◦ and only a small portion of a scene can be reconstructed with a standard binocular stereo system. In the last decades, much effort has been conducted in the research field of omnidirectional stereo vision, which allows an almost complete scene reconstruction if the cameras are mounted at the ceiling. However, as omnidirectional images show strong distortion artifacts, most of the approaches perform an image warping to reduce the reconstruction complexity. In this work, we examine the impact of the omnidirectional image distortion on the learning process of a CNN. We compare the results of a network training with perspective and omnidirectional stereo images. For this work, we use AnyNet and a novel dataset of synthetic omnidirectional and perspective stereo images.

Paper Nr: 199
Title:

Procam Calibration from a Single Pose of a Planar Target

Authors:

Ghani O. Lawal and Michael Greenspan

Abstract: A novel user friendly method is proposed for calibrating a procam system from a single pose of a planar chessboard target. The user simply needs to orient the chessboard in a single appropriate pose. A sequence of Gray Code patterns are projected onto the chessboard, which allows correspondences between the camera, projector and chessboard to be automatically extracted. These correspondences are fed as input to a nonlinear optimization method that models the projection of the principle points onto the chessboard, and accurately calculates the intrinsic and extrinsic parameters of both the camera and the project, as well as the camera’s distortion coefficients. The method is experimentally validated on a real procam system, which is shown to be comparable in accuracy with existing multi-pose approaches. The impact of the orientation of the chessboard with respect to the procam imaging places is also explored through extensive simulations.

Paper Nr: 217
Title:

Object Hypotheses as Points for Efficient Multi-Object Tracking

Authors:

Shuhei Tarashima

Abstract: In most multi-object tracking (MOT) approaches under the tracking-by-detection framework, object detection and hypothesis association are addressed separately by setting bounding boxes as interfaces among them. This subdivision has greatly yielded advantages with respect to tracking accuracy, but it often lets researchers overlook the efficiency of whole MOT pipelines, since these interfaces can cause the time-consuming data communication between CPU and GPU. Alternatively, in this work we define an object hypothesis as a keypoint representing the object center, and propose simple data association algorithms based on the spatial proximity of keypoints. Different from standard data association methods like Hungarian algorithm, our approach can easily be run on GPU, which enables direct feed of detection results generated on GPU to our tracking module without the need of CPU-GPU data transfer. In this paper we conduct a series of experiments on MOT16, MOT17 and MOT-Soccer datasets in order to show that (1) our tracking module is much more efficient than existing methods while achieving competitive MOTA scores, (2) our tracking module run on GPU can improve the whole MOT efficiency via reducing the overhead of CPU-GPU data transfer between detection and tracking, and (3) our tracking module can be combined to a state-of-the-art unsupervised MOT method based on joint detection and embedding and successfully improve its efficiency.