Yet, an underused part of these datasets are the rich, natural language annotations accompanying each video. (2/12)
How can we use language supervision to learn better visual representations for robotics?
Introducing Voltron: Language-Driven Representation Learning for Robotics!
Paper: https://t.co/gIsRPtSjKz
Models: https://t.co/NOB3cpATYG
Evaluation: https://t.co/aOzQu95J8z
🧵👇(1 / 12)
![](https://pbs.twimg.com/media/Fp_Pp79agAA36b8.jpg)
Yet, an underused part of these datasets are the rich, natural language annotations accompanying each video. (2/12)
The secret is *balance* (3/12)
1) Condition on language and improve our ability to reconstruct the scene.
2) Generate language given the visual representation and improve our ability to describe what's happening. (4/12)
Why is the ability to shape this balance important? (5/12)
How do we know?
Because we build an evaluation suite of 5 diverse robotics problem domains! (6/12)
![](https://pbs.twimg.com/media/Fp_QR5gaUAE8axM.jpg)
Evaluation: the ARC Grasping dataset (https://t.co/rRI4ya84DL) – CC @andyzengtweets @SongShuran. (7/12)
![](https://pbs.twimg.com/media/Fp_QjJWacAA5WSf.jpg)
Modeling *multi-frame* contexts (easy with Voltron) is also high-impact!
Evaluation: Franka Kitchen & Adroit Manipulation domains from R3M – CC @aravindr93 @Vikashplus. (8/12)
![](https://pbs.twimg.com/tweet_video_thumb/Fp_QwZnaYAMAnrT.jpg)
Given a video & language intent, we can score – in real time – how well the behavior in the video captures the intent.
Transfers to *robot data* – no robots during pretraining! (9/12)
![](https://pbs.twimg.com/media/Fp_Q5o1acAAl0Vt.jpg)
Models & Pretraining: https://t.co/NOB3cpATYG
Evaluation Suite: https://t.co/aOzQu95J8z
Use our models: `pip install voltron-robotics` (10/12)
![](https://pbs.twimg.com/media/Fp_RMr1aEAEbnPD.jpg)
Further thanks to @ToyotaResearch, @stanfordnlp, and the @StanfordAILab ! (11/12)