Lessons learned about benchmarking, adversarial testing, the dangers of over- and under-claiming, and AI alignment.
Transcript: https://web.stanford.edu/class/cs224u/podcast/bowman/
Sam's websiteSam on TwitterNYU LinguisticsNYU Data ScienceNYU Computer ScienceAnthropicSNLI paper: A large annotated corpus for learning natural language inferenceSNLI leaderboardFraCaSSICKA SICK cure for the evaluation of compositional distributional semantic modelsSemEval-2014 Task 1: Evaluation of Compositional Distributional Semantic Models on Full Sentences through Semantic Relatedness and Textual EntailmentRTE Knowledge ResourcesRichard SocherChris ManningAndrew NgRay KurtzweilSQuADGabor AngeliAdina WilliamsAdina Williams podcast episodeMultiNLI paper: A broad-coverage challenge corpus for sentence understanding through inferenceMultiNLI leaderboardsTwitter discussion of LLMs and negationGLUESuperGLUEDecaNLPGPT-3 paper: Language Models are Few-Shot LearnersFLANWinograd schema challengesBERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingJSALT: General-Purpose Sentence Representation LearningEllie PavlickEllie Pavlick podcast episodeTal LinzenIan TenneyDipanjan DasYoav GoldbergFine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction TasksBig BenchUpworkSurge AIDynabenchDouwe KielaDouwe Kiela podcast episodeEthan PerezNYU Alignment Research GroupEliezer Shlomo YudkowskyAlignment Research CenterRedwood ResearchPercy Liang podcast episodeRichard Socher podcast episode