Notes on Large Language Models and Linguistic Theory

After having a recent conversation with Steven Piantadosi on large language models, I wanted to briefly comment here on some of the themes in our discussion, before turning to additional critiques that we did not have enough time to talk about.

When we discussed impossible vs. possible language, Steven seemed to confuse languages with the language faculty. We can make claims about what we think impossible languages are based on theoretical architecture, and just because we haven’t documented the profile of every language in human history it does not follow that we cannot make sensible inferences about what the human language faculty is. At the same time, generative grammar has actually drawn attention to smaller language families and also endangered languages. Andrew Nevins at UCL has a new book called When Minoritized Languages Change Linguistic Theory, which showcases examples throughout the history of generative grammar from across syntax, morphology, semantics, phonology in which minoritized languages have disrupted assumptions in generative grammar and forced theory-modification.

With respect to what we were discussing about language and thought, I would just add briefly that within generative grammar many have maintained (as I did) that language provides a new format for thought. This is not to say that it exhausts what it means to think, it just means that it modifies the pre-existing primate conceptual apparatus in very specific ways.

AI Hype

Here is a very abridged list of what large language models have been claimed to be capable of over the last few months: they have theory of mind, they are a master of all trades, they can do domain-agnostic reasoning, egocentric memory, control, they have inner monologues, can do basic arithmetic, solve computer tasks, word class representations spontaneously emerge from them, they can perform statutory reasoning, they are versatile decomposers, they can do causal reasoning over entities and events, they can think like a lawyer, they can self-improve, they can execute moral judgments, they can self-correct, they can rate the meaning of various sounds, they can detect sarcasm, they can do logical reasoning, they can be conscious, they spontaneously develop autonomous scientific research capabilities, they can do analogical reasoning, and they can generate ancient Chinese poetry. Perhaps it’s fair to say that the AI hype has gone a bit too far? Some at least seem prone to this. Walid Saba was on the Machine Learning Street Talk podcast recently to explain why he changed his mind on LLMs, and now believes they have mastered natural language syntax. He said: “I’m a scientist; I see a big result, I say wow”. But that’s not what being a scientist is – being a scientist is seeing a big result and saying ‘hang on a second’. Too much hype, not enough reflection.

Divergences

The goals of science are based around concerns of parsimony – but the goals of the machine learning enterprise are based around megamony.

Language model states typically carry information both about the ‘world’ and about language; information of either kind is typically useful for various tasks, so we don’t know at any given moment what information language models use and the content of their representations, even if we know what task they perform. This is a kind of indeterminacy problem. How can we construct a theory of language from this?

Implementation

In his paper, Steven cites Edelman as saying that evidence from neuroscience for things like traces/copies and movement remains elusive. But theoretical syntacticians are not trying to model traces and indices in order to actually find them in the brain somewhere – that’s the goal of psycholinguistics, process models at the algorithmic level, informed by these abstract computational models in various intricate ways. Likewise, FocusPhrase or ForcePhrase are not expected to yield meaningfully unique BOLD responses or ECoG responses.

Theories of Language

The goal of science is theory-building. Empirical evidence is used to support, test and improve theories. Olivia Guest’s new work on theory formation proposes a metatheoretical calculus, some way of choosing between competing theories, and one of the criteria is simply metaphysical commitments, i.e. which aspects of the theory are just assumed, and are not under active interrogation and investigation. So a theory of visual attention will probably not ever entertain the idea that visual attention does not exist, but it might assume some mechanisms and phenomena. In generative grammar, we have some specific metaphysical commitments about the architecture and format of language, but it’s much less clear what a theory of language derived from modern language models can offer here – what are its metaphysical commitments?

Claiming that ‘Modern language models refute Chomsky’s approach to language’ is a category error. Theories are different from frameworks and programs, because many theories can be within the same program. Do MLMs refute Chomsky (2013), or Chomsky (2022), etc? Conversely, we would never say that ‘Chomsky’s approach to language refutes modern language models’. One is a research program, the other is an engineering tool. LLMs do not prove anything about what humans do, so it’s odd to state that they refute a whole enterprise in cognitive science.

Until the basic properties of syntax are captured by MLMs – or even the semantic properties of basic adjectives, which are also currently out of reach according to recent work from Liu et al. (2023) – it’s premature to say that they refute Chomsky’s approach to language. Infants learn syntax with semantics, and this semantics updates generative models of the world. If we want to open up the black box of these models some have argued for using probing or looking at tensor products or tree projections etc., and these methods might get you somewhere but I don’t see how these mechanisms can replace linguistic theory.

I also don’t see how higher attention scores in transformer models at the point of multi-head attention replace or inform conceptual role theory. Likewise, can neural networks perform symbolic manipulation? If so, can locality-sensitive hashing tables perform symbolic manipulation (and why not)? All the enactive part for ChatGPT is done by humans. The generative AI is not acting, and it’s generating content not beliefs. It works in data space, its purpose is not to ‘understand’.

Principles of Language

There are no concrete models of language that we get from MLMs, there are no clear principles. Linguistic theory is really unique in this respect. Theories of vision get closer here with respect to principles of computation. For example, phase theory in generative syntax, or something like the classical Freezing Principle, assume, basically, that material inside a certain constructed phrase is inaccessible to further manipulation when some kind of raising occurs, such that no more material can be extracted out of it.

a. I think that John never reads [reviews of his books]

b. Whose books_i do you think that John never reads [reviews of t_i]?

c. I think that [reviews of his books]_i John never reads t_i

d. *Whose books_i do you think that [reviews of t_i]_j John never reads t_j ?

In some generative circles, this has recently been given a more ‘cognitive’ rather than formal treatment, for example a processing-related explanation. Another explanation for the Freezing Principle says that there are some prosodic reasons for it. But the point to focus on here is that the tools of linguistic theory allow us to negotiate the locus of these kinds of effects. How do MLMs improve this or provide novel insights here?

Linguistic theory offers a clear principle of language here. There’s nothing else like this in cognitive science. David Marr was asked shortly before he died if there was anything like this principle in vision, whereby ‘when a structure undergoes a non-structure-preserving transformation the interior of the structure can’t be analyzed further’. Marr said that you couldn’t tell because all transformations in vision were linear according to him, i.e. they were structure-preserving.

Models and Architectures

A lot of recent research working with transformer models will sometimes say things like ‘this particular point of the architecture (i.e., everything from tokens, to embedding, to positional embedding, to multi-head attention, to modified vectors) looks a lot like binding’ or ‘looks like filler-role independence’ or ‘looks like Merge’. But a lot of things can look like binding, and a lot of stuff looks like reading tea leaves too, so how can we draw up a more principled connection to the stuff of language without falling prey to redescription of linguistic theory rather than re-explanation? Raphael Milliere, whose work beautifully negotiates between linguistic theory and machine learning, thinks that transformer models can implement a kind of non-classical constituent structure (i.e. something that isn’t straight up concatenation), and thinks that they also have a kind of ‘fuzzy’ variable binding which is not strict algebraic variable binding where you have graded, probabilistic bindings of fillers to roles. Raphael also thinks we have ‘shades’ of role-filler independence via overlapping subspaces during the multi-head attention phase. At least for now, it seems that these artificial systems might be doing structured representations but so far it’s all sub-symbolic stuff.

Some recent work by Shikhar Murty and Christopher Manning and colleagues has argued that transformers might be able to learn to become tree-like when trained on language data. They looked at some sequence transduction tasks. But even Murty and colleagues conclude after showing possible tree-like computations that “our results suggest that making further progress on human-like compositional generalization might require inductive biases that encourage the emergence of latent tree-like structure.” A lot of similar claims about tree-structure computations in transformers and language models are based on ‘extrinsic’ assessments of performance rather than intrinsic assessments that directly estimate how well a parametric tree-structured computation approximates the model’s computation.

Other recent work that takes the insights of syntactic theory seriously shows that linguistic theory helps with model scalability, rather than hindering it. Sartran and colleagues add inductive biases (syntactic priors) to their transformers, and show massive improvement.

Language, Thought and Communication

Anna Ivanova said a few weeks ago, at a talk she gave at NYU during a symposium on deep learning. that there’s a fallacy whereby some people assume that if a system is bad at thought, then it must be bad at language. She gave Chomsky as an example of that, quoting him saying ‘what do LLMs tells us about language? Zero’. But there is a problem here – Chomsky wasn’t marshalling evidence for LLMs being ‘bad at thought’ to argue that LLMs don’t tell us anything about language, in fact I don’t think he ever gave any examples of ‘bad thought’; his reason for why LLMs tell us nothing about language was a more architectural point about their ability to learn impossible languages.

In other recent work, Ivanova argues that the language network needs to interact with other brain networks in different regions, like social cognition in lateral parietal regions, situation model construction in medial parietal regions, world knowledge in highly distributed cortices, general cognitive tasks in middle frontal cortex and superior parietal cortex, and semantic processes in various frontotemporal sites. But all of this is highly compatible with the minimalist architecture of a core language system interfacing with extra-linguistic systems. It’s also compatible with certain interpretations of the concept of autonomy of syntax.

Chomsky has never claimed that combinatorial rules are blind to the content/meaning of elements; we have selectional requirements, we have intricate relations between feature-checking operations in minimalist syntax that directly determine what kind of Merge operation you can execute, and when you can execute it. All autonomy of syntax means is that there are syntactic mechanisms that aren’t semantic, not that semantics is irrelevant. And sure enough, there are some syntactic mechanisms that are not semantic.

Ivanova, Piatandosi and many others commonly cite work reporting aphasic patients showing no deficits in complex reasoning, and use this to undermine generative grammar. But we expect this under a non-lexicalist framework of generative syntax: meaning, syntax and ‘form’ are all separate systems, with separate representations in long-term memory, and ‘the lexicon’ is not a thing but a process of combining these three feature types together. Other, more complex objects can, of course, be stored in lexical memory for efficient retrieval, depending on the language, and indeed the person. So impairments in syntactic features and syntactic structure does not lead to the prediction that conceptual features will be impaired.

Cognitive Plausibility

GPT-4 and its predecessors don’t have long term memory – they don’t build a sense of understanding or self. However, LLMs have a working memory of thousands of items, but much research in generative syntax these days concerns the interfaces of mapping structure to distinct workspaces and the order in which certain portions of structure are interpreted, and the important role that memory constraints have on this. Even Christiansen and Chater, no supporters of generative grammar, suggest that humans have a now-or-never bottleneck whereby human memory limitations impose radical constraints on grammaticality, and grammaticalization. These memory constraints and considerations are not part of the discourse on MLMs.

So it may be, then, that language models need to be actively impaired and disrupted in some way to more accurately model human performance, as Andrew Lampinen at DeepMind has suggested.

LLMs will very likely form part of some ultimate AGI system, if AGI can even be achieved. LLMs don’t seem likely to radically innovate beyond their training data, nor do they seem capable of carrying out long and subtle chains of reasoning. But they can inform and interface with other artificial systems that could do these things (e.g., the Wolfram Alpha plugin for ChatGPT) – building a modular architecture not unlike the generative framework for the human mind.

Robert Long recently gave a talk in which he listed all the ways in which cognitive science has contributed to AI progress with LLMs, and the list was empty. Which is maybe debatable, but even so that kind of result actually just reinforces the kind of disconnect that people like Chomsky have been trying to highlight. These are just two separate fields of study. Even stuff like attention in transformer models has nothing to do with human-like attention, it’s a post-hoc metaphor. And convolutions come from very behaviorist black box kinds of frameworks – everywhere you look you see divergence between these fields.

The Syntax of Screenshots

One example Steven uses in his paper, and some other places, is the Carnie syntax textbook example, from 2002. In his textbook (note: not a monograph or polemical piece), Carnie says: Premise 1, syntax is a productive, recursive and infinite system; Premise 2, rule-governed infinite systems are unacquirable; Conclusion: therefore syntax is an unacquirable system. Since we have such a system, it follows that at least parts of syntax are innate.

Steven screenshots this section of text. But then in the immediately following paragraph, Carnie says: ‘There are parts of this argument that are very controversial. In the challenge problem sets at the end of this chapter you are invited to think very critically about the form of this proof. Problem set X considers the possibility that premise 1 is false (but hopefully you will conclude that despite the argument given in the problem set the idea that language is productive and infinite is correct). Premise 2 is more dubious, and is the topic of problem set Y. You are invited to be skeptical and critical of these premises when you do the problem set.’

Wolfram Beta

Steven Wolfram’s rule 30 showed beautifully that a simple rule can lead to computational complexity. Something similar arrives with Chomsky’s minimalist program – you have two operations, internal merge and external merge (unified in various ways in recent work), and from its interfaces with different systems and their own domain-specific conditions on interpretation and externalization you get the computational complexity of human language.

One might think that there is a potential sympathy developing here. However, Steven Wolfram said in an essay he wrote in February about ChatGPT that “my strong suspicion is that the success of ChatGPT implicitly reveals an important “scientific” fact: that there’s actually a lot more structure and simplicity to meaningful human language than we ever knew—and that in the end there may be even fairly simple rules that describe how such language can be put together.” Wolfram’s argument here is very familiar to linguists.

Wolfram says: “ChatGPT provides perhaps the best impetus we’ve had in two thousand years to understand better just what the fundamental character and principles might be of that central feature of the human condition that is human language and the processes of thinking behind it.” But the fundamental character and principles of human language are not obscure. Wolfram skips over the entire tradition of modern post-war linguistics, what’s been discovered since the 1950s.

Still, even Wolfram says that we will never actually be able to figure out what ChatGPT is doing except through tracing each step, it might be computationally irreducible – “it’s not clear that there’s a way to summarize what it’s doing’ in terms of a clear narrative description”.

Wolfram only gives one example of a rule that could be learned by ChatGPT, and that concerns basic logic: “Thus, for example, it’s reasonable to say “All X are Y. This is not Y, so it’s not an X”. And just as one can somewhat whimsically imagine that Aristotle discovered syllogistic logic by going through lots of examples of rhetoric, so too one can imagine that in the training of ChatGPT it will have been able to “discover syllogistic logic” by looking at lots of text on the web, etc.”

Yet, the difference here is that Aristotle explicitly discovered syllogistic logic and rhetorically described some of its apparent properties, but even the slave boys in Greece had the competence for syllogistic reasoning; they just didn’t explicitly formalize it or give it a name. It’s very different for ChatGPT – ChatGPT didn’t know syllogistic reasoning ab initio, and it may just look like it’s executing some basic logical operations, but a few difficult probe questions later and it looks like it can’t do it after all.

Chomsky Hierarchy

A recent paper called ‘Neural networks and the Chomsky hierarchy’ from Deletang and colleagues shows that “grouping tasks for NNs according to the Chomsky hierarchy allows us to forecast whether certain architectures will be able to generalize to out-of distribution inputs. Our results show that, for our subset of tasks, RNNs and Transformers fail to generalize on non-regular tasks, LSTMs can solve regular and counter-language tasks, and only networks augmented with structured memory (such as a stack or memory tape) can successfully generalize on context-free and context-sensitive tasks.”

Deletang and colleagues discuss how RNNs are not Turing-complete, they lie lower on the Chomsky hierarchy. And previous work has shown that RNNs and LSTMs are capable of learning simple context-sensitive languages, but in a very limited way, i.e., they generalize only to lengths close to those seen during training (Bodén & Wiles, 2000, 2002; Gers & Schmidhuber, 2001). Transformers are capable of learning complex and highly structured generalization patterns, but they cannot overcome the limitation of not having an extendable memory. This might imply hard limits for scaling laws (Kaplan et al., 2020), because even significantly increasing the amount of training data and the size of a Transformer are insufficient for it to climb the Chomsky hierarchy.

But at the same time, the evolution of the Chomsky hierarchy is totally irrelevant to Merge-based systems. Merge-based systems operate over structures, not strings. The Chomsky hierarchy adapts to language Post’s general theory of computability, based on “rewriting systems”: rules that replace linear strings of symbols with new linear strings. All of the “formal languages” generated at the various levels of this hierarchy involve linear order. But Merge-based systems have (hierarchical) structure but no linear order, an essential property of binary sets formed by Merge. Merge-based systems do not even appear in the Chomsky hierarchy, and anything concluded from the study of the Chomsky hierarchy is irrelevant to the evolution of Merge-based systems.

Conclusion

Whatever the technological innovations of ChatGPT are, it doesn’t seem to be telling us about human language. This seems self-evident for at least two reasons: (1) no child is exposed to the data that ChatGPT is trained on; (2) there are impossible grammatical rules, e.g. mirror image rules, which ChatGPT could easily acquire but humans wouldn’t.

Chomsky’s main point is that distributional statistics alone will not capture language – and that remains to be refuted. The difference between humans and non-human primates is not simply a matter of scale – there must be some kind of fundamental algorithmic difference to get you human-like compositionality. And that’s the essence of Chomsky’s work, and the generative enterprise.

Even so, it is not impossible that the traditional methods of linguistics may have been exhausted for now, and insights may emerge from other areas, such as research into the neural representation of language.

An ancient alchemical dictum, In Sterquilinis Invenitur, translates into “in filth it will be found”. One reading of this is “what you are searching for the most will be found in the place you least want to look”. This forms the thematic basis of a number of ancient myths, but it may also be the case that both sides of this current debate Steven and I are engaged in – modern language model research, and traditional linguistic theory – could mutually benefit from searching in places we least want to look, for answers that are, in the end, continuing to elude both sides.

Notes on Large Language Models and Linguistic Theory

Share this:

Leave a comment Cancel reply