FlowRetrieval: Flow-Guided Data Retrieval for Few-Shot Imitation Learning

Stanford University

FlowRetrieval leverages optical flow representations for extracting relevant prior data and guiding policy learning to maximally benefit from the retrieved data.

Our method uses only 10 target task demonstrations to learn the policy.

Abstract

Imitation learning in robotics requires extensive amount of demonstrations while zero-shot performance of such pretrained policies often struggle. It is critical to develop few-shot adaptation strategies that rely on a small amount of demonstrations for the target task. Recent research has shown that augmenting training data with past experiences can provide abundant signals when learning from small data. However, existing data retrieval methods fall under two extremes: they either rely on the existence of exact same behaviors with visually similar objects in the prior data, which is impractical to assume; or they retrieve based on semantic similarity of high-level language descriptions of the task, which might not be that informative about the shared behaviors or motions across tasks. In this work, we investigate how we can leverage the vast amount of cross-task data that share similar motion with the target task to improve few-shot imitation learning. Our key insight is that motion-similar data carry rich information about the effects of actions and object interactions that can be leveraged during few-shot adaptation. We propose FlowRetrieval, an approach that leverages optical flow representations for both extracting similar motions to target tasks from prior data, and for guiding learning of a policy that can maximally benefit from such data. Our results show FlowRetrieval significantly outperforms prior methods across simulated and real-world domains, with over 4X performance of vanilla imitation.

Method Overview

We learn a motion-centric latent space for retrieving motions similar to target data from prior data and further guide policy learning with optical flow.

Method Overview
  • Motion-Centric Pretraining: FlowRetrieval acquires a motion-centric latent space by computing optical flow between the current frame and a future frame of the robot's RGB visual observations, and employing a variational autoencoder (VAE) to embed the optical flow data.
  • Data Retrieval with the Learned Motion-Centric Latent Space: We select the nearest neighbors of target task in the latent space from previously collected data.
  • Flow-Guided Learning: During policy learning, FlowRetrieval leverages an auxiliary loss of predicting the optical flow as additional guidance for representation learning, encouraging the model to encode the image with enough details to reconstruct optical flow alongside predicting the action.

Experiments

Tasks

Square Assembly (Target | Bad Prior | Good Prior)

LIBERO-Can (Target | Prior 1 | Prior 2)

Bridge-Pot (Target | Prior 1 | Prior 2)

Bridge-Microwave (Target | Prior 1 | Prior 2)

Franka-Pen-in-Cup (Target | PnP | Wild)


Quantitative Results

  • FlowRetrieval outperforms the best baseline method across different tasks, achieving an average of 14% higher success rate than the best baseline method in each domain (+10% in simulation,+19% in real).
  • FlowRetrieval also achieves on average 27% higher success rate than the best retrieval-based prior method.

Qualitative Analysis

FlowRetrieval retrieves similar motion from prior data while BR and SR retrsieve based on visual similarity of the state and consequently can retrieve adversarial data.

  • In the Square Assembly example, both baselines end up retrieving data points where the robot moves towards the round peg. FlowRetrieval on the other hand selects a very similar motion toward the square peg even when the background color is completely different from that in target data.
  • In LIBERO-Can, BR and SR again focus mainly on visual similarity instead of the picking up motion and retrieved data points that have very different motion (moving downwards). However, FlowRetrieval retrieves the correct motion of picking up, even if it is picking up a different object than in the target task.


  • When retrieving from the PnP dataset, FlowRetrieval focuses on the pick-up and transfer stage of the target task, and does not retrieve the placing motions from the prior dataset, effectively filtering adversarial prior data.
  • When retrieving from the Wild dataset, we see that FlowRetrieval retrieves viewpoints better aligned with that in the target task, while ProprioRetrieval retrieves very different viewpoints (sometimes the robot is not even in the view -- see example in rightmost column, second from bottom).

In the Square Assembly task, the data right after the robot picks up the nut is crucial: actions that move towards the wrong goal are adversarial data. The data points after the forking point, while possibly performing a different task, are considered non-harmful. We can therefore analyze the quality of retrieved data under different retrieval methods by plotting the amount of different types of data retrieved from each stage of the task (split into 10 bins). FlowRetrieval uniformly retrieves from prior useful data and retrieves little adversarial data, while baseline models either cannot effectively filter out the adversarial data or does not retrieve enough useful data at the bottleneck stage (between pick up and transfer) of the task.