Sequence Agnostic MultiON

Sequence Agnostic Multi-ON: this is a task in which a robot is tasked to find a set of objects in an environment in a sequence-agnostic manner(no ordering constraint)

For example: Assume you are in a supermarket with a 15-item grocery list. Let's also assume that you don’t know the locations of various objects. How would you go about putting items in your cart?

One way could be to go according to the numbering on the list. Giving high priority to item placed higher in the list. For ex: If “Egg” is written above “Bread” in the list, then even though I come across Bread during my pursuit for “Egg” I’ll ignore it till I begin my search for “Bread”. It won’t be optimal but you could be assured that you’ll get all the objects.
Another possible approach could be, you start with the intention of “Let's start my exploration from this area, and whatever item I see, I’ll check with the list and put it in my cart if it needs to be.”

We tried to extend this simple concept to a more intuitive approach. Suppose your list consists only of frozen objects, then you needn’t start exploring a random direction. You can directly go in search of a fridge and quickly put them all in your cart. As simple and intuitive as this sounds, implementing this approach raised 3 major questions:

How do we tell an agent that these all objects belong to the frozen section? [i.e how to learn object-object and object-region relationships]
We previously assumed that it was an unexplored supermarket, so how would we know where to find a fridge? [i.e. appropriate exploration techniques]
What if your list comprises various such clustering? Frozen stuff, Vegetables, Stationery items? Now what? [i.e. ]

This work tries to answer the first two questions whereas I provide with a solution and framework for integrating the third question’s solution White Paper's Link 📜

In this blog I’ll try to limit myself to the work I’ve contributed to.

The thinking process

We tried to integrate RL with Image Segmentation Models. Why RL? Explained at last.
We went about training our agents in various supermarket scenes, but they didn’t have the same structure or same object placements. The only thing common in different training scenes was the relative positions of sections. Like Fruits and Vegetable Sections occurred together, Frozen and Boulangerie sections together, and so on. This helped our agent in deciding that “Okay! Fridge, Cheese, Butter, Jam is in my view (Image Segmentation Model) I should look around nearby to search for bread.”
We tried to use RL to give us a long-term policy, which would tell us a long-term goal (2-D ground coordinates) representing where the agent should go to find these objects in an optimal fashion. Inputs to this RL policy were the current semantic map (built on the go using the image segmentation model) and the embeddings created using the target object names. The embedding part was added as input because our policy needs to learn what object name corresponded to which section, i.e. what long-term goal proved good for what objects.

While everything here is being explained in context of supermarket, in our work we performed training and evaluation on household scenes on Habitat Simulator.

In household scenario the first question translates to what are the different object-object relationship for various static objects and how can we leverage it to enhance our search for a list of such objects?