In this talk I am going to present the work we have been doing at the Computer Vision Lab of the Technical University of Munich which started as an attempt to better deal with videos (and therefore the time domain) within neural network architectures. Oddly enough, we ended up not including time at all in our proposed solutions. In the first work, we tackle the task of semi-supervised video object segmentation, i.e., the separation of an object from the background in a video, given the mask of the first frame. I will present One-Shot Video Object Segmentation (OSVOS), based on a fully-convolutional neural network architecture that is able to successively transfer generic semantic information, learned on ImageNet, to the task of foreground segmentation, and finally to learning the appearance of a single annotated object of the test sequence (hence one-shot). OSVOS is fast and improves the state of the art by a significant margin (79.8% vs 68.0%). The second work I will present is a new CNN+LSTM architecture for camera pose regression for indoor and outdoor scenes. Contrary to most works, we make use of LSTM units on the CNN output in spatial coordinates in order to capture contextual information. This substantially enlarges the receptive field of each pixel leading to drastic improvements in localization performance. I will also present a new large-scale indoor dataset with accurate ground truth from a laser scanner.
Organizers: Joel Janai
In this talk I will present the portfolio of work we conduct in our lab. Herby, I will present three recent body of work in more detail. This is firstly our work on learning 6D Object Pose estimation and Camera localizing from RGB or RGBD images. I will show that by utilizing the concepts of uncertainty and learning to score hypothesis, we can improve the state of the art. Secondly, I will present a new approach for inferring multiple diverse labeling in a graphical model. Besides guarantees of an exact solution, our method is also faster than existing techniques. Finally, I will present a recent work in which we show that popular Auto-context Decision Forests can be mapped to Deep ConvNets for Semantic Segmentation. We use this to detect the spine of a zebrafish, in case when little training data is available.
Organizers: Aseem Behl
We propose a new computational framework for combinatorial problems arising in machine learning and computer vision. This framework is a special case of Lagrangean (dual) decomposition, but allows for efficient dual ascent (message passing) optimization. In a sense, one can understand both the framework and the optimization technique as a generalization of those for standard undirected graphical models (conditional random fields). We will make an overview of our recent results and plans for the nearest future.
Organizers: Aseem Behl
Matching between two sets arises in various areas in computer vision, such as feature point matching for 3D reconstruction, person re-identification for surveillance or data association for multi-target tracking. Most previous work focused either on designing suitable features and matching cost functions, or on developing faster and more accurate solvers for quadratic or higher-order problems. In the first part of my talk, I will present a strategy for improving state-of-the-art solutions by efficiently computing the marginals of the joint matching probability. The second part of my talk will revolve around our recent work on online multi-target tracking using recurrent neural networks (RNNs). I will mention some fundamental challenges we encountered and present our current solution.