SemanticPaint, a new interactive and online approach to 3D scene understanding*.
Some weeks ago Microsoft published a paper focused on scene understanding. In practice It demonstrates how their system is able to recognize objects on the fly online and how labelling works through voice commands. We’re speaking about SemanticPaint; it allows users to simultaneously scan their environment, whilst interactively segmenting the scene simply by reaching out and touching any desired object or surface this new volumetric inference technique divides or combines the scanned bits and pieces into valid objects.
An interactive user support helps to recognize a new object class (e.g. swiping over the chair).
System continuously learns from these segmentations, and labels new unseen parts of the environment. This approach is fully online. This provides users with continuous live feedback of the recognition during capture, allowing them to immediately correct errors in the segmentation and/or learningÂ a feature that has so far been unavailable to batch and offline methods. This leads to models that are tailored or personalized specifically to the user’s environments and object classes of interest, opening up the potential for new applications in augmented reality, interior design, and human/robot navigation. It also provides the ability to capture substantial labeled 3D datasets for training large-scale visual recognition systems. Imagine walking into a room with a wearable consumer depth cam- era. As you move within the space, the dense 3D geometry of the room is automatically scanned in real-time. As you begin to phys- ically interact with the room, touching surfaces and objects, the 3D model begins to be automatically segmented into semantically meaningful regions, each belonging to a specific category such as wall, door, table, book, cup and so forth.
As you continue interact- ing with the world, a learned model of each category is updated and refined with new examples, allowing the model to handle more variation in shape and appearance. When you start observing new instances of these learned object classes, perhaps in another part of the room, the 3D model will automatically and densely be labeled and segmented, almost as quickly as the geometry is acquired. If the inferred labels are imperfect, you can quickly and interactively both correct the labels and improve the learned model. At the end you rapidly generate accurate, labeled 3D models of large environments. Very soon Microsoft (or others) could create huge databases of everyday objects helping the machine to see and really to semantically understand objects around us! I’m not going to the scary direction now. The cool applications? Get audio-descriptions of objects and places, use it for independent robot navigation or for new gaming experiences. Your smartphone could fully understand your environment (not only a depth map ) and integrate it into the gaming virtual space. Be it physics beahviour (to make the virtual pieces more real) or completely new game concepts where you must use the correct real world props to interact with the virtual world! Exciting! *Video :