New Horizons in Language Science:

Large Language Models, Language Structure, and the Cognitive and Neural Basis of Language


May 13-14, 2024 @ The U.S. National Science Foundation (NSF). Alexandria, Virginia.


Large language models are remarkably successful as technological tools, and pose challenges and opportunities for the scientific study of natural language in the human mind and brain. This workshop presents talks, commentary, and discussion dedicated to the following three themes:

  • What insights do large language models provide for the study of human language?
  • What insights does the study of human language provide for large language model development?
  • What key future scientific opportunities lie at the interface between the study of human language and large language model development?

to watch the workshop recordings!

Specific talks can also be accessed via the 📹 icons next to the scheduled speakers.


Schedule (Eastern Time; precise times subject to revision)


Monday, May 13:

9:00–9:10am Opening remarks by NSF leadership
9:10–9:30am Introductory remarks and orientation by workshop organizers
9:30am–12:30pm Theme 1: What insights do large language models provide for the study of human language?
9:30–9:35am Kara Federmeier: Introductory remarks for Theme 1
9:35–9:55am Benjamin Bergen: Large Language Models as distributional baselines for human language processing research 📹
9:55–10:15am Leila Wehbe: Learning representations of complex meaning in the human brain 📹
10:15–10:35am Ariel Goldstein: Deep modeling as (more) than (just) cognitive framework 📹
10:35–10:55am David Bau: Locating neural functions and facts 📹
10:55–11:10am Break
11:10–11:50am Moderated commentary session with Raquel Fernandez, Tal Linzen, and Jon Willits 📹
11:50am–12:30pm Moderated panel discussion including all speakers and commentators from the theme 📹
12:30–2:00pm Lunch Break
2:00–5:00pm Theme 2: What insights does the study of human language provide for large language model development?
2:00–2:05pm Christopher Manning: Introductory remarks for Theme 2
2:05–2:25pm Tom McCoy: Using insights from linguistics to understand and guide Large Language Models 📹
2:25–2:45pm Najoung Kim: Linguistic tests as unit tests for AI systems 📹
2:45–3:05pm Adina Williams: The shifting landscape of LM evaluation 📹
3:05–3:25pm Ryan Cotterell: A formal perspective on language modeling 📹
3:25–3:40pm Break
3:40–4:20pm Moderated commentary session with commentators Kyle Mahowald, Anna Rogers, and Timothy Rogers 📹
4:20–5:00pm Moderated panel discussion including all speakers and commentators from the theme 📹

Tuesday, May 14:
9:00am–12:00pm Theme 3: What key future scientific opportunities lie at the interface between the study of human language and large language model development?
9:00–9:05am Roger Levy: Introductory remarks for Theme 3
9:05–9:25am Adele Goldberg: Compositionality in natural language and LLMs 📹
9:25–9:45am Anna Ivanova: Dissociating language and thought in Large Language Models 📹
9:45–10:05am Gasper Begus: Interpretability techniques for scientific discovery 📹
10:05–10:25am Alex Huth: Mapping and decoding language representations in human cortex 📹
10:25–10:40am Break
10:40–11:20am Moderated commentary session with commentators Antonios Anastasopoulus, Laura Gwilliams, and Ishita Dasgupta 📹
11:20am–12:00pm Moderated panel discussion including all speakers and commentators from the theme 📹
12:00–1:30pm Lunch Break
1:30–2:30pm Summative talks by workshop organizers Kara Federmeier, Roger Levy, and Christopher Manning 📹
2:30–2:45pm Closing remarks by NSF leadership


Speakers


David Bau
Northeastern University
Gašper Beguš
UC Berkeley
Benjamin Bergen
UC San Diego
Ryan Cotterell
ETH Zürich
Kara Federmeier
University of Illinois
Adele Goldberg
Princeton University
Ariel Goldstein
Hebrew University
Alex Huth
UT Austin
Anna Ivanova
Georgia Tech
Najoung Kim
Boston University
Christopher Manning
Stanford University
Tom McCoy
Yale University
Leila Wehbe
Carnegie Mellon

Commentators


Raquel Fernández
University of Amsterdam
Laura Gwilliams
Stanford University
Kyle Mahowald
UT Austin
Anna Rogers
IT University of Copenhagen
Timothy Rogers
University of Wisconsin
Jon Willits
University of Illinois

Organizing Team


Kara Federmeier
University of Illinois
Christopher Manning
Stanford University
Benjamin Lipkin
MIT (Student Liaison)

Motivation


This is a workshop dedicated to interdisciplinary connections between today’s large language models, language structure, and language processing in the human mind and brain. The structure of language, and the mental and neural basis of how it is learned, understood, and produced, have been perennial central questions in linguistics, computer science, cognitive science, and neuroscience. Historically, it has been a major challenge to develop implemented computational models that can generate and process language in anything approaching a human-like manner.

In recent years, however, this situation has been transformed by the impressive success of modern deep-learning technology: relatively simple artificial neural network architectures, when coupled with large-scale natural language corpora and computational software and hardware for training massive models with billions to hundreds of billions of parameters, learn to generate complex text of remarkable fluency and even seem to exhibit numerous "emergent" behaviors such as the ability to rhyme, metaphorical language use, and certain types of common-sense reasoning. Contemporary large language models (LLMs) achieve these successes even though–or perhaps, because–their internal representations are high-dimensional numeric embedding vectors that superficially seem to be very unlike the symbolic, hierarchical grammatical representations traditionally used to describe linguistic structure. Despite this apparent difference, LLMs' context-based word predictions reflect complex aspects of linguistic structure and correlate with human behavioral responses, tree-structured grammatical representations of sentences can be decoded with surprising accuracy from LLMs' embeddings, and those embeddings can even be used to predict high-dimensional brain responses during real-time language comprehension.

But language in LLMs is also very different from language in humans. LLMs' training data is not grounded in extra-linguistic sensory or social context; their inductive biases do not always reflect common features found across languages of the world; their interpretive strategies can be fooled by superficial features of linguistic inputs; their patterns in ambiguity management differ from humans; and their common-sense reasoning patterns are often unreliable and inconsistent. In some cases, symbolic approaches can still yield superior performance on their own or in tandem with LLMs. Overall, while LLMs constitute remarkable technological advances, there are strong reasons to believe that they offer far from a complete picture of language development and processing in the human mind and brain.

Inspired by this state of affairs, this workshop offers interdisciplinary talks and discussion spanning the fields of machine learning & natural language processing, linguistics, neuroscience, and cognitive science.


Abstracts


Benjamin Bergen: Large Language Models as distributional baselines for human language processing research

Cognitive and linguistic capacities are often explained as resulting from sources outside of language, such as innate predispositions or grounded, embodied, or situated learning. But Large Language Models now rival human performance in a variety of tasks that have been argued to derive from these external causes. This raises the question: what human experimental results require language-external explanations, and which are consistent with a distributional, statistical learning account? We consider several linguistic inference phenomena—relating to affordances, pronoun resolution, and false belief inference. In each case, we run both human participants and LLMs on the same task and ask how much of the variance in human behavior is explained by the LLMs. As it turns out, in all cases, human behavior is not fully explained by the LLMs. This entails that, at least for now, we need something that goes beyond statistical language learning to explain these aspects of human language processing. At the same time, the LLM predictions do explain some of the human variance in these tasks, which simultaneously suggests a need for tighter experimental controls and more restricted inference when attributing human behavior to potential causes. The talk will conclude by asking—but not answering—a number of questions, like what the right criteria are for an LLM that serves as a proxy for human statistical language learning.


Leila Wehbe: Learning representations of complex meaning in the human brain

It has become increasingly common to use representations extracted from modern language models to study language comprehension in the human brain. This approach often achieves accurate prediction of brain activity, often accounting for almost all the variance in the recordings that is not attributable to noise. However, better prediction performance doesn't always lead to better scientific interpretability. This talk presents some approaches for the difficult problem of making scientific inferences about how the brain represents high-level meaning. While these inferences are based on the powerful ability of today's language models to predict brain recordings, this talk also explores the limitations of these models and their divergence from brain activity recordings, suggesting some language phenomena that they process differently than humans.


Ariel Goldstein: Deep Modeling as (more) than (just) cognitive framework

In my presentation, I will explore the assertion that deep learning-based models are not mere black boxes but rather valid frameworks for articulating computational theories for cognition and their neural infrastructure. My focus will primarily be on the comprehension and production of natural speech, using naturalistic stimuli and unrestrained conversations. I will illustrate common principles shared between deep language models (text-based), deep multimodal models (text and audio-based), and the brain using the neural activity associated with speech. I will discuss the new light shed by adopting deep modeling of fundamental long-lasting issues in cognitive neuroscience.


David Bau: Locating neural functions and facts

Can we locate knowledge within a neural language model? Within an artificial neural network we can see every step of a neural computation. We discuss how simple counterfactual interventions can trace the neural mechanisms underlying a transformer language model's factual predictions such as “Miles Davis plays the trumpet.” And then we use the same technique to trace a language model’s remarkable ability to generalize a function after seeing examples, revealing a concrete mechanism for composing functional tasks.


Tom McCoy: Using insights from linguistics to understand and guide Large Language Models

A central goal in linguistics is characterizing how human language works. In this talk, I will discuss how such characterizations can contribute to the development of large language models (LLMs). First, analyses from linguistics can be used as standards against which we can evaluate LLMs, helping us to understand these notoriously hard-to-understand systems. Second, analyses from linguistics can also be used as targets toward which we can guide LLMs through specialized training procedures, enabling the creation of models that learn faster and generalize more robustly. Thus, linguistics can help make language models both more interpretable and more controllable.


Najoung Kim: Linguistic tests as unit tests for AI systems

The models underlying current breakthroughs in AI use language as a core medium of problem solving as well as interfacing with humans. In this regard, (behavioral) tests that gauge the linguistic capacity of AI systems serve as unit tests---small, focused tests to verify expected system behavior---for the stability of the models' core. Then, the role of the study of human language is clear: it is critical in defining both the unit itself and the test target. It motivates a fundamentally different carving of the problem space compared to benchmarks targeting downstream tasks such as information-seeking QA or machine translation. For example, if models systematically struggle with negation, this will affect many downstream tasks in practice, but may not always be captured by the task-specific benchmarks. I will discuss two lines of work in this direction on compositional generalization and entity tracking, and show how the test outcomes can inform both synchronic and diachronic solutions in model development. In this discussion, I will additionally argue that the linguistic unit tests need not and should not be fully analogous to tests targeting humans because the goal of AI is not necessarily to build a model _of_ humans but is to build something useful.


Adina Williams: The shifting landscape of LM evaluation

Cognitive scientists have contributed extensively to the development of large language models in the past and the present. Most notably, they have performed essential data work, have devised methods for interpreting and explaining model behavior, and have created important model performance evaluations. Despite this, the growing productionization potential of language models is spurring a shift in the types of contributions cognitive scientists are positioned to make. This talk will describe what this state of affairs means for cognitive scientists focusing on LM evaluation, and point to opportunities for new kinds of collaborations between cognitive scientists and the developers of machine learning models.


Ryan Cotterell: A formal perspective on language modeling

Language models—especially the large ones—are all the rage. And, for what will surely be one of only a few times in history, my field, natural language processing, is the center of world attention. Indeed, there is nearly a daily stream of articles in the popular press on the most recent advances in language modeling technology. In contrast to most of these articles (and most other talks on the topic), this tutorial-style presentation is not about forward progress in the area. Instead, I am going to take a step back and ask simple questions about the nature of language modeling itself. We will start with the most basic of questions: From a mathematical perspective, what is a language model? Next, the talk will turn philosophical. With all the talk of artificial general intelligence, what can theory of computation bring to bear on the computational power of language models? The talk will conclude with a statement of several recent theorems proven by my research group, the highlight of which is that no Transformer-based language model is Turing complete and, thus, we should be careful about labeling such language models, e.g., GPT-4, as general-purpose reasoners.


Adele Goldberg: Compositionality in natural language and LLMs

Today’s LLMs interpret and produce language without using abstract rules, and close attention to the complexity of natural languages suggests this may be more of a feature than a bug. Behavioral parallels between LLMs and human language highlight the statistical aspects of both systems, raise questions about representations and mechanisms, and beckon us toward a deeper understanding of creativity and compositionality.


Anna Ivanova: Dissociating language and thought in large language models

Today’s large language models (LLMs) routinely generate coherent, grammatical and seemingly meaningful paragraphs of text. This achievement has led to speculation that LLMs have become “thinking machines”, capable of performing tasks that require reasoning and/or world knowledge. In this talk, I will introduce a distinction between formal competence—knowledge of linguistic rules and patterns—and functional competence—understanding and using language in the world. This distinction is grounded in human neuroscience, which shows that formal and functional competence recruit different brain mechanisms. I will show that the word-in-context prediction objective has allowed LLMs to essentially master formal linguistic competence; however, pretrained LLMs still lag behind at many aspects of functional linguistic competence, prompting engineers to adopt specialized fune-tuning techniques and/or couple LLMs with external modules. I will then turn to world knowledge, a capability where the formal/functional distinction is less clear-cut, and discuss our efforts to leverage both cognitive science and NLP to develop systematic ways to probe world knowledge in text-based LLMs. Overall, the formal/functional competence framework clarifies the discourse around LLMs, helps develop targeted evaluations of their capabilities, and suggests ways for developing better models of real-life language use.


Gasper Begus: Interpretability techniques for scientific discovery

Interpretability is the new frontier in AI research. Understanding how generative models learn and how they resemble or differ from humans can not only provide insights for the study of human language, but can also facilitate discovery of novel patterns in diverse fields. For this purpose, it is essential to both introspect LLMs that test the limits of neural computation as well as to develop deep neural models that learn more like human infants acquiring language. In this talk, I outline a more realistic model of human language acquisition and introduce two AI interpretability techniques for the two approaches. The first technique uses our custom-built models and finds a causal relationship between individual neurons and linguistically meaningful properties. The second technique uses metalinguistics as a window into LLM's capabilities. Using the proposed techniques, we can compare and evaluate artificial and biological neural processing of language. Additionally, I show that AI interpretability techniques can facilitate scientific discovery by uncovering previously unrecognized patterns in complex data types.


Alex Huth: Mapping and decoding language representations in human cortex

Is it possible to read the content of human thought out using recordings of brain activity? We use non-invasive functional MRI and machine learning methods based on large language models to investigate the relationship between brain activity and the content of thought. The models and modeling techniques we have developed reveal complex spatial and temporal patterns of brain activity that relate to specific categories of linguistic information as well as representational timescales. We show that this information can be read out as language, even when the stimulus evoking it is from another modality. These results point to a future of neuroscience that strongly integrates modern neural network models.



With Support from the Linguistics, Robust Intelligence, and Science of Learning and Augmented Intelligence Programs at NSF.