# Abstracts Track 2021

## Area 1 - Image and Video Formation, Preprocessing and Analysis

Nr: | 12 |

Title: | ## A Minimal Model for Rotation Invariant Convolutional Neural Networks with Prediction of the Angle |

Authors: | ## Rosemberg Rodriguez Salas, Eva Dokladalova and Petr Dokladal |

Abstract: | In classification tasks, the robustness against various image transformations remains a crucial property of the Convolutional Neural Networks (CNNs). It can be acquired using data augmentation. However, it comes at the price of risk of increased training time and network size. Consequently, other ways to endow CNN with invariance to various transformations -- and mainly to the rotations -- are an intensive field of study. It is common to find that the filters in the first layers of CNNs contain rotated copies of the same filter (e.g., identical edge detectors in several orientations). We propose a network containing a bank of learnable steerable filters in the first layer. The network learns a unique basis filter and generates an ensemble of oriented rotated copies. We organize the filter-ensemble in increasing order of orientation. Each filter then gets activated by an edge aligned with its orientation. This methodology allows the network to capture the angular geometric relationship properties of the input data. The filter bank is then seen as the input decomposition in oriented features. These features are then aligned to the vertical reference obtaining a translational feature space that is covariant with the input rotation. The filter bank is then a roto-translational feature space containing the information of the rotations of the input encoded as translations over the depth of the feature space. Then we apply a shared weights predictor that scans each translation (hence each orientation) and outputs a probability set for each one. This probability distribution contains the class information as predictor's output and information of the angle encoded in the translations' position. The maximum probability position corresponds to the angle of the input example; hence, it is obtained without angle labeling in the training set. The prediction model that we propose shares weights between each translation, allowing the network to have a reduced model capable of class and angle inference with rotation invariant properties. Rotation invariant properties are best tested when the network is trained with objects in the same orientation (usually up-right orientation) and validated on randomly oriented examples. Hence, we train the network with up-right oriented examples and validate with randomly rotated examples to validate the network's rotation invariant capabilities. With this methodology, we outperform state-of-the-art results on the MNIST and CIFAR-10 datasets. On the MNIST dataset, we obtain a 0.93% error rate with 42k trainable parameters. This result reaches the state-of-the-art error rate while keeping the number parameters lower than other approaches by at least 50% fewer parameters. On the CIFAR-10 dataset with randomly rotated validation, we achieve a 36.41% error rate outperforming the current 55.88\% error rate of previous approaches. Furthermore, the network uses 73k trainable parameters that is less than the previous methods with 130k parameters. In all cases, we can predict the classified object angle. In conclusion, we obtain competitive state-of-the-art results on error rate while keeping a low-footprint network in terms of trainable parameters. Also, our network has angular prediction capabilities without angle labels in the training set. Smaller networks allow faster training times, embedded devices support, and a reduction in the energy costs involved with CNNs. |

Nr: | 12 |

Title: | ## A Minimal Model for Rotation Invariant Convolutional Neural Networks with Prediction of the Angle |

Authors: | ## Rosemberg Rodriguez Salas, Eva Dokladalova and Petr Dokladal |

Abstract: | In classification tasks, the robustness against various image transformations remains a crucial property of the Convolutional Neural Networks (CNNs). It can be acquired using data augmentation. However, it comes at the price of risk of increased training time and network size. Consequently, other ways to endow CNN with invariance to various transformations -- and mainly to the rotations -- are an intensive field of study. It is common to find that the filters in the first layers of CNNs contain rotated copies of the same filter (e.g., identical edge detectors in several orientations). We propose a network containing a bank of learnable steerable filters in the first layer. The network learns a unique basis filter and generates an ensemble of oriented rotated copies. We organize the filter-ensemble in increasing order of orientation. Each filter then gets activated by an edge aligned with its orientation. This methodology allows the network to capture the angular geometric relationship properties of the input data. The filter bank is then seen as the input decomposition in oriented features. These features are then aligned to the vertical reference obtaining a translational feature space that is covariant with the input rotation. The filter bank is then a roto-translational feature space containing the information of the rotations of the input encoded as translations over the depth of the feature space. Then we apply a shared weights predictor that scans each translation (hence each orientation) and outputs a probability set for each one. This probability distribution contains the class information as predictor's output and information of the angle encoded in the translations' position. The maximum probability position corresponds to the angle of the input example; hence, it is obtained without angle labeling in the training set. The prediction model that we propose shares weights between each translation, allowing the network to have a reduced model capable of class and angle inference with rotation invariant properties. Rotation invariant properties are best tested when the network is trained with objects in the same orientation (usually up-right orientation) and validated on randomly oriented examples. Hence, we train the network with up-right oriented examples and validate with randomly rotated examples to validate the network's rotation invariant capabilities. With this methodology, we outperform state-of-the-art results on the MNIST and CIFAR-10 datasets. On the MNIST dataset, we obtain a 0.93% error rate with 42k trainable parameters. This result reaches the state-of-the-art error rate while keeping the number parameters lower than other approaches by at least 50% fewer parameters. On the CIFAR-10 dataset with randomly rotated validation, we achieve a 36.41% error rate outperforming the current 55.88\% error rate of previous approaches. Furthermore, the network uses 73k trainable parameters that is less than the previous methods with 130k parameters. In all cases, we can predict the classified object angle. In conclusion, we obtain competitive state-of-the-art results on error rate while keeping a low-footprint network in terms of trainable parameters. Also, our network has angular prediction capabilities without angle labels in the training set. Smaller networks allow faster training times, embedded devices support, and a reduction in the energy costs involved with CNNs. |

Nr: | 13 |

Title: | ## Depth Recovery from Non-uniform Haze Image |

Authors: | ## Tomoki Suzuki, Fumihiko Sakaue and Jun Sato |

Abstract: | In this paper, we propose a learning-based method for estimating the scene depth from a hazy image. In particular, we show that by estimating haze density distributions as well as the scene depth, we can recover the scene depth accurately from non-uniform haze. The existing research on hazy images can be divided into two classes. The first one is image dehazing which removes haze from a hazy image. The second one is depth recovery from hazy images. However, these existing studies assume that the haze density, i.e. haze coefficient, is uniform in the scene. That is the light attenuation depends only on the scene depth. However, the haze density varies from place to place in general, and the assumption of constant haze density does not hold in the real scene. Thus, in this research, we propose a new method for estimating scene depth from non-uniform haze images. Our network consists of U-Net for depth recovery and VGG16 for haze density estimation. The haze density distribution is represented by a parametric function, and the VGG16 network estimates a limited number of parameters for the haze density distribution. These two networks are trained simultaneously for recovering both the scene depth and the haze density distribution. Since the scene depth recovery and the image dehazing are closely related to each other, we adopt a cycle consistency loss which measures the difference between the input hazy image and the hazy image generated from the estimated scene depth and haze density distribution. We trained our network using synthetic haze images which are generated by Beer-Lambert law, and tested the network by using synthetic haze images as well as real haze images. The results of our method show that the proposed net can estimate the scene depth and the non-uniform haze distribution accurately from an input hazy image. For the detail of our method and results, please see our complementing materials. |

## Area 2 - Mobile and Egocentric Vision for Humans and Robots

Nr: | 14 |

Title: | ## Distance Measurement in Fog using Polarized ToF Camera |

Authors: | ## Yuta Watarai, Fumihiko Sakaue and Jun Sato |

Abstract: | In this study, we propose a method for appropriate distance measurement in a scattering medium such as fog by combining a ToF camera and polarizing filters. In our proposed method, we equip the polarizing filters in front of the light projector and the camera of the ToF camera. Therefore, the ToF camera project the polarized IR light to the object. Following the Mie scattering model, the light scattered in the direction of the camera by the fog remains polarized. On the other hand, the light reflected on the object's surface is transformed into natural light. Therefore, by changing the direction polarizing filter mounted on the camera, taking pictures contains scattered light from the fog and the reflected light from the object separately. Analysis of these images obtained in this way makes it possible to simultaneously estimate the distance between the ToF camera and the object and parameters related to fog concentration. Experimental results by our proposed method indicate that the proposed method can be used to achieve appropriate distance measurement even in scattering media such as fog. |

## Area 3 - Image and Video Understanding

Nr: | 17 |

Title: | ## Brush Motion Estimation from Japanese Callibraphy Image Based on Multi-task Learning |

Authors: | ## Yoshiharu Yokoo, Fumihiko Sakaue and Jun Sato |

Abstract: | In this study, we propose a method for estimating a Japanese calligraphic image's brush motion by analyzing it. Japanese calligraphic images, such as shown in Figure Figure {fig:}, contain larger information than ordinary handwritten characters due to the characteristics with a brush for the calligraphy image writing. In this study, we analyze the image and estimate the motion of the brush. For this objective, we define a temporal information image and estimate it. This temporal information image represents the time when each pixel is drawn, and then, we can estimate the motion of the brush by estimating this image. The input calligraphy image contains various kinds of information, such as the kind of characters, stroke order, and stroke speed transition. In this study, we investigate a method to improve temporal image estimation accuracy by multitasking learning. In this learning, we estimate the temporal information image and the other kinds of information simultaneously in the framework of multi-task learning. |

## Area 4 - Motion, Tracking and Stereo Vision

Nr: | 18 |

Title: | ## A Hybrid Occlusion Handling Method for Multiple Pedestrian Tracking |

Authors: | ## Bo Chen |

Abstract: | Multiple Object Tracking (MOT) aims to locate multiple objects in a video, maintain their identities, and obtain their trajectories. This research focuses on multiple inter-occluded pedestrian tracking, and its main objective is to improve tracking accuracy and speed. At the first stage, we aim to improve trackers' efficiency; therefore, we investigate several trackers' performance on different occlusion states and propose a hybrid method that combines different trackers to find a trade-off between the tracking speed and accuracy. |

Nr: | 15 |

Title: | ## Bundle Adjustment for Body Mounted Cameras based on Human Motion Constraint |

Authors: | ## Kanji Ito, Fumihiko Sakaue and Jun Sato |

Abstract: | In this study, we propose a motion capture system based on the camera pose estimation results by bundle adjustment. In this bundle adjustment, the position of cameras mounted on various parts of the human body and 3D scene information of the surrounding environment are estimated from the image. This method considers the cameras equipped on the human body as motion capture markers and estimate their positions and poses by bundle adjustment. In such systems, the number of available markers is reduced compared to normal motion capture systems. To compensate for this lack of markers, we utilize a neural network to predict the larger number of markers from a small number of markers obtained by the bundle adjustment. The neural network achieves accurately predicts the human body posture even from a small number of markers. Besides, the neural network also represents prior knowledge of the human pose. Therefore, the prior can be utilized for more accurate pose estimation by bundle adjustment. Experimental results show that we can estimate parts of the human body's locations from only a few markers. In addition, experimental results show that the human pose prior based on NN improves the accuracy of pose estimation. |