ODT - THESIS TOPIC: László Czúni: Viewer centered object recognition ...

Viewer centered object recognition with mobile computing

THESIS TOPIC PROPOSAL

Institute: University of Pannonia
computer sciences
Doctoral School of Information Science and Technology

Thesis supervisor: László Czúni
co-supervisor: Zoltán Kató
Location of studies (in Hungarian): University of Pannonia, Faculty of Information Technology, Department of Electrical Engineering and Information Systems
Abbreviation of location of studies: PE

Description of the research topic:

There are several psychophysical supports for two-dimensional view interpolation theory for object recognition. In [Bülthoff] it is suggested that the human visual system can be described by recognizing 3D objects by 2D view interpolation. In [Fang] viewpoint aftereffects also prove that object-selective neurons can be tuned to specific viewing angles in the human visual system.
Viewer centered recognition methods can be considered as early attempts for the recognition of 3D objects. The idea of storing only a limited number of views of 3D objects and then applying transformations to find correspondence with other views already appear for example in [Basri] where novel views are generated by the linear combination of stored ones. Rigid objects with smooth surfaces and articulated objects could also be represented this way.
Recently used multilayer deep learning recognition approaches discover intricate structure in large data sets by using the back propagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation of the previous layer. While there are such successful techniques for object recognition in large databases [Szegedy], [Krizhevsky], these techniques require tremendous performance regarding processing power and memory.
Handheld 3D object recognition is a difficult task due to changing viewpoints, varying 3D to 2D projections, possible different noises (e.g. motion blur, color distortion), and the limited computational performance and memory. Local feature descriptors (like SIFT, FAST, etc) are often used for view centred recognition. In [Noor] the underlying topological structure of an image dataset was generated as a neighborhood graph of features. Motion continuity in the query video was exploited to demonstrate that the results obtained using a video sequence are much robust than using a single image.

PROPOSED APPROACH
It is obvious that video gives much more information about 3D objects than simply 2D projections. Not only the different views of the objects can be recorded but the 3D structure can be reconstructed by structure from motion techniques. However, these later approaches require good quality images and camera calibration with relatively large computational power still far from most of the mobile computing platforms and intelligent sensor motes. Luckily mobile computing devices often contain inertial measurements units (IMUs) and the calibration of cameras can be combined with IMUs [Hol]. However, it is still an open question how to exploit the IMUs in video recognition without going through the structure from motion processing methodology. Our research is focused on a viewer centered recognition model where the relative position of the target object and the camera is utilized. Our preliminary experiments already showed [Czuni] that IMUs can help in the recognition process with low computational demands. However, fast object tracking and/or segmentation still can be a problem in this framework being also a subject for research. Most object recognition are “passive” from the model side. We propose to build up model-driven interactive retrieval methods where the search engine gives hint how to move the camera around the object to get the fastest and most reliable recognition result.

H. H. Bülthoff and S. Edelman, “Psychophysical support for a two-dimensional view interpolation theory of object recognition,” Proceedings of the National Academy of Sciences, vol. 89, no. 1, pp. 60–64, 1992.
F. Fang and S. He, “Viewer-centered object representation in the human visual system revealed by viewpoint aftereffects,” Neuron, vol. 45, no. 5, pp. 793–800, 2005.
R. Basri, “Viewer-centered representations in object recognition: A computational approach,” in Handbook of pattern recognition & computer vision. World Scientiﬁc Publishing Co., Inc., 1993, pp. 863–882.
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in CVPR 2015, 2015. [Online]. Available: http://arxiv.org/abs/1409.4842
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcation with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25, P. Bartlett, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, Eds., 2012, pp. 1106–1114. [Online]. Available: http://books.nips.cc/papers/ﬁles/nips25/NIPS2012 0534.pdf [21]
H. Noor, S. H. Mirza, Y. Sheikh, A. Jain, and M. Shah, “Model generation for video-based object recognition,” in Proceedings of the 14th ACM International Conference on Multimedia, ser. MM ’06. New York, NY, USA: ACM, 2006, pp. 715–718. [Online]. Available: http://doi.acm.org/10.1145/1180639.1180791
J. D. Hol, T. B. Sch¨on, and F. Gustafsson, “A new algorithm for calibrating a combined camera and imu sensor unit,” in Control, Automation, Robotics and Vision, 2008. ICARCV 2008. 10th International Conference on. IEEE, 2008, pp. 1857–1862.
L. Czuni and M. Rashad, “Lightweight video object recognition based on sensor fusion,” in Computational Intelligence for Multimedia Understanding (IWCIM), 2015 International Workshop on, Oct 2015, pp. 1–5

Number of students who can be accepted: 1

Deadline for application: 2016-11-15