More from Angela Jamison

More from All

These 10 threads will teach you more than reading 100 books

Five billionaires share their top lessons on startups, life and entrepreneurship (1/10)


10 competitive advantages that will trump talent (2/10)


Some harsh truths you probably don’t want to hear (3/10)


10 significant lies you’re told about the world (4/10)
How can we use language supervision to learn better visual representations for robotics?

Introducing Voltron: Language-Driven Representation Learning for Robotics!

Paper: https://t.co/gIsRPtSjKz
Models: https://t.co/NOB3cpATYG
Evaluation: https://t.co/aOzQu95J8z

🧵👇(1 / 12)


Videos of humans performing everyday tasks (Something-Something-v2, Ego4D) offer a rich and diverse resource for learning representations for robotic manipulation.

Yet, an underused part of these datasets are the rich, natural language annotations accompanying each video. (2/12)

The Voltron framework offers a simple way to use language supervision to shape representation learning, building off of prior work in representations for robotics like MVP (
https://t.co/Pb0mk9hb4i) and R3M (https://t.co/o2Fkc3fP0e).

The secret is *balance* (3/12)

Starting with a masked autoencoder over frames from these video clips, make a choice:

1) Condition on language and improve our ability to reconstruct the scene.

2) Generate language given the visual representation and improve our ability to describe what's happening. (4/12)

By trading off *conditioning* and *generation* we show that we can learn 1) better representations than prior methods, and 2) explicitly shape the balance of low and high-level features captured.

Why is the ability to shape this balance important? (5/12)

You May Also Like

The YouTube algorithm that I helped build in 2011 still recommends the flat earth theory by the *hundreds of millions*. This investigation by @RawStory shows some of the real-life consequences of this badly designed AI.


This spring at SxSW, @SusanWojcicki promised "Wikipedia snippets" on debated videos. But they didn't put them on flat earth videos, and instead @YouTube is promoting merchandising such as "NASA lies - Never Trust a Snake". 2/


A few example of flat earth videos that were promoted by YouTube #today:
https://t.co/TumQiX2tlj 3/

https://t.co/uAORIJ5BYX 4/

https://t.co/yOGZ0pLfHG 5/
The UN just voted to condemn Israel 9 times, and the rest of the world 0.

View the resolutions and voting results here:

The resolution titled "The occupied Syrian Golan," which condemns Israel for "repressive measures" against Syrian citizens in the Golan Heights, was adopted by a vote of 151 - 2 - 14.

Israel and the U.S. voted 'No'
https://t.co/HoO7oz0dwr


The resolution titled "Israeli practices affecting the human rights of the Palestinian people..." was adopted by a vote of 153 - 6 - 9.

Australia, Canada, Israel, Marshall Islands, Micronesia, and the U.S. voted 'No' https://t.co/1Ntpi7Vqab


The resolution titled "Israeli settlements in the Occupied Palestinian Territory, including East Jerusalem, and the occupied Syrian Golan" was adopted by a vote of 153 – 5 – 10.

Canada, Israel, Marshall Islands, Micronesia, and the U.S. voted 'No'
https://t.co/REumYgyRuF


The resolution titled "Applicability of the Geneva Convention... to the
Occupied Palestinian Territory..." was adopted by a vote of 154 - 5 - 8.

Canada, Israel, Marshall Islands, Micronesia, and the U.S. voted 'No'
https://t.co/xDAeS9K1kW