DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning

🔥 DecisionNCE has been accepted by ICML2024 and selected as outstanding paper at MFM-EAI workshop@ICML2024

Introduction

DecisionNCE , mirrors an InfoNCE-style objective but is distinctively tailored for decision-making tasks, providing an embodied representation learning framework that elegantly extracts both local and global task progression features , with temporal consistency enforced through implicit time contrastive learning, while ensuring trajectory-level instruction grounding via multimodal joint encoding. Evaluation on both simulated and real robots demonstrates that DecisionNCE effectively facilitates diverse downstream policy learning tasks, offering a versatile solution for unified representation and reward learning.

Quick Start

Install

Clone this repository and navigate to DecisionNCE folder

git clone https://github.com/2toinf/DecisionNCE.git
cd DecisionNCE

Install Package

conda create -n decisionnce python=3.8 -y
conda activate decisionnce
pip install -e .

Usage

import DecisionNCE
import torch
from PIL import Image
# Load your DecisionNCE model

device = "cuda" if torch.cuda.is_available() else "cpu"
model = DecisionNCE.load("DecisionNCE-P", device=device)

image = Image.open("Your Image Path Here")
text = "Your Instruction Here"

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    reward = model.get_reward(image, text) # please note that number of image and text should be the same

API

`decisionnce.load(name, device)`

Returns the DecisionNCE model specified by the model name returned by decisionnce.available_models(). It will download the model as necessary. The name argument should be DecisionNCE-P or DecisionNCE-T

The device to run the model can be optionally specified, and the default is to use the first CUDA device if there is any, otherwise the CPU.

The model returned by decisionnce.load() supports the following methods:

`model.encode_image(image: Tensor)`

Given a batch of images, returns the image features encoded by the vision portion of the DecisionNCE model.

`model.encode_text(text: Tensor)`

Given a batch of text tokens, returns the text features encoded by the language portion of the DecisionNCE model.

Train

Pretrain

We pretrain vision and language encoder jointly with DecisionNCE-P/T on EpicKitchen-100 dataset. We provide training code and script in this repo. Please follow the instructions below to start training.

Data preparation

Please follow the offical instructions and download the EpicKitchen-100 RGB images here. And we provide our training annotations reorganized according to the official version

start training

We use Slurm for multi-node distributed finetuning.

sh ./script/slurm_train.sh

Please fill in your image and annotation path in the specified location of the script.

Model Zoo

Models	Pretaining Methods	Params (M)	Iters	Pretrain ckpt
RN50-CLIP	DecisionNCE-P	386	2W	link
RN50-CLIP	DecisionNCE-T	386	2W	link

Evaluation

Result

simulation

real robot

Visualization

We provide our jupyter notebook to visualize the reward curves. Please install jupyter notebook first.

conda install jupyter notebook

TO BE UPDATE

Citation

If you find our code and paper can help, please cite our paper as:

@inproceedings{lidecisionnce,
  title={DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning},
  author={Li, Jianxiong and Zheng, Jinliang and Zheng, Yinan and Mao, Liyuan and Hu, Xiao and Cheng, Sijie and Niu, Haoyi and Liu, Jihao and Liu, Yu and Liu, Jingjing and others},
  booktitle={Forty-first International Conference on Machine Learning}
}

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
DecisionNCE		DecisionNCE
assets		assets
events		events
script		script
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
index.html		index.html
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning

Introduction

Contents

Quick Start

Install

Usage

API

`decisionnce.load(name, device)`

`model.encode_image(image: Tensor)`

`model.encode_text(text: Tensor)`

Train

Pretrain

Model Zoo

Evaluation

Result

Visualization

Citation

About

Releases

Packages

Contributors 3

Languages

License

2toinf/DecisionNCE

Folders and files

Latest commit

History

Repository files navigation

DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning

Introduction

Contents

Quick Start

Install

Usage

API

decisionnce.load(name, device)

model.encode_image(image: Tensor)

model.encode_text(text: Tensor)

Train

Pretrain

Model Zoo

Evaluation

Result

Visualization

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

`decisionnce.load(name, device)`

`model.encode_image(image: Tensor)`

`model.encode_text(text: Tensor)`

Packages