We introduce a new paradigm for dataset creation based on human 🧑‍💻 and machine 🤖 collaboration, which brings together the generative strength of LMs and the evaluative strength of humans. And we collect 🎉 WaNLI, a dataset of 108K NLI examples! 🧵

Paper: https://t.co/IUXcm9wIh2

Our pipeline starts with an existing dataset (MNLI), and uses data maps 📜 to automatically identify pockets of examples that demonstrate challenging 🧐 reasoning patterns relative to a trained model. Then we use GPT-3 to generate new examples likely to have the same pattern. 2/
Next we propose a new metric, also inspired by data maps, to automatically filter generations for those most likely to aid model learning. Finally, we validate ✅ the generated examples through crowdworkers, who assign a gold label 🟡 and (optionally) revise for quality ✍️. 3/
Remarkably, replacing MNLI with WaNLI (which is 4x smaller) for training improves performance📈 on seven OOD test sets🧪, including by 11% on HANS and 9% on ANLI. Under a data augmentation setting, combining MNLI with WaNLI is more effective than using other augmentation sets. 4/
Our method addresses limitations of crowdsourcing, where workers may resort to repetitive writing strategies 🤷, and leverages the great progress in text generation 📃. We get the best of both worlds: 🤖’s ability to produce diverse examples, and 🧑‍💻’s ability to evaluate them. 5/
We hope our work demonstrates the promise of leveraging LMs in a controlled way to aid the dataset creation process, and encourage the community to think of dataset curation as an AI challenge itself 💡. Co-authored with @swabhz @nlpnoah @YejinChoinka 💟 6/6

More from All

@franciscodeasis https://t.co/OuQaBRFPu7
Unfortunately the "This work includes the identification of viral sequences in bat samples, and has resulted in the isolation of three bat SARS-related coronaviruses that are now used as reagents to test therapeutics and vaccines." were BEFORE the


chimeric infectious clone grants were there.https://t.co/DAArwFkz6v is in 2017, Rs4231.
https://t.co/UgXygDjYbW is in 2016, RsSHC014 and RsWIV16.
https://t.co/krO69CsJ94 is in 2013, RsWIV1. notice that this is before the beginning of the project

starting in 2016. Also remember that they told about only 3 isolates/live viruses. RsSHC014 is a live infectious clone that is just as alive as those other "Isolates".

P.D. somehow is able to use funds that he have yet recieved yet, and send results and sequences from late 2019 back in time into 2015,2013 and 2016!

https://t.co/4wC7k1Lh54 Ref 3: Why ALL your pangolin samples were PCR negative? to avoid deep sequencing and accidentally reveal Paguma Larvata and Oryctolagus Cuniculus?

You May Also Like