Brain activity aligns with artificial contextual embeddings: What next?

Psycholinguistic theory has traditionally pointed to the importance of explicit symbol manipulation in humans, to explain our unique facility for symbolic forms of intelligence such as language and mathematics. A recent report utilizing intracranial recordings in a cohort of three participants argues that language areas in the human brain rely on a continuous vectorial embedding space to represent language.

Published recently in Nature Communications, Ariel Goldstein and colleagues recorded activity from the inferior frontal cortex while participants listened to a podcast. Focusing on single, word-by-word vectorial representations, the authors show that contextual embeddings derived from deep language models show similar geometric patterns to certain neural activity profiles in frontal cortex.

This means that inferior frontal cortex is generally sensitive to the kinds of statistics and cross-lexical similarities also picked up by deep language models. This is revealing, and rather intriguing work that should force linguists to consider how to carefully migrate their models of language into genuine neural dynamics.

These important results help reveal the role of the inferior frontal cortex in linguistic prediction, and shed much-needed light on the rapid temporal dynamics of lexico-semantic processing and prediction in this central language region.

The authors are surely correct to note that there is likely a “vector-based neural code for natural language processing” unveiled by these types of results. I would like to briefly add some additional points of discussion, with the goal of empirical expansion and conceptual refinement. Lexico-semantic and conceptual spaces being organized, parsed, accessed and stored along the types of dimensions outlined by deep language models probably only gets us an approximate representation of the true richness of language meaning, since there is much more to meaning (e.g., complex polysemy) than what can be read off from matrix multiplications and co-occurrence statistics.

A more fundamental qualification is that this type of research is distinct from figuring out the nature and format of intrinsic semantic features. It indicates that inferior frontal cortex is sensitive to how similar in semantic space clusters of words are, and building off of these findings will hopefully extend these types of mechanisms into the subsequent levels of the linguistic hierarchy.

What exactly comes next, though? Language processing is about the use and deployment of statistical information not as an end in itself, but in order to reach an underlying state of sensitivity to structure. Statistics seeking structure, not statistics driving structure. The authors note that self-supervised deep learning can “capture subtle statistical dependencies reflecting syntactic, semantic, and pragmatic relationships among words”. This is certainly true, although the bulk of processing here is reliant on linearity criteria: the kinds of relations one can readily plug into MATLAB and Python. However, as many other linguists have hypothesized, syntactic and semantic processing also calls upon certain biases that are independent of statistics, such as hierarchical, structure-dependent rules based on simple computational procedures such as Merge and its various formal restrictions. The parsing, algorithmic settlement of this architecture has been attempted via programs such as minimalist grammars and tree-adjoining grammars, aimed at capturing the expressive power of natural language, rather than the frequency profiles of lexical co-occurrences.

Because so much of ordinary language processing places demands on semantics, but typically less so on syntax (e.g., building phrases after phrases is a more routine and constrained process than searching one’s conceptual space to assemble the semantic content called upon by listening to podcasts), neural signatures of semantic integration and lexico-semantic content have been, unsurprisingly, easier to detect than the subtle differences between when the parser needs to wrap-up three open syntactic nodes compared to two nodes and then an adjunct.

The difficulty of discovering a clear neural signature of syntax has surprised and confused many linguists and cognitive neuroscientists, some of whom over recent years have quietly abandoned the task, to the point where now many believe the only hope we have for easily-detectible language signatures is to use deep language models to tell us what happens when the brain processes fruit-related words compared to animal-related words.

To connect these ideas more specifically with Goldstein and colleague’s work, the authors note that both the human brain and deep language models (DLM) “[incorporate] prior context into the meaning of individual words, spontaneously [predict] forthcoming words, and [compute] post-word-onset prediction error signals”.

Symbolic and neuro-symbolic architectures are nevertheless still quite compatible with the fact that the brain predicts upcoming words and incorporates past data into current lexical input. If vector-based neural codes can sufficiently capture cortical dynamics for single words, this is of great interest to those aiming to carve out the full neural code for lexico-semantics. But the applicability of vector-based codes for individual words does not preclude the feasibility of symbolic and/or neuro-symbolic architectures for supra-lexical information. ‘Language’ is surely not well-captured purely by leaving aside phrasal and sentential processing – especially given that these are virtually synonymous with linguistic forms of intelligence. Building off from Goldstein and colleague’s intriguing set of results may involve figuring out ways to reach these higher planes of thought.

To emphasize further the clear prospects for Goldstein and colleague’s general results being productively built on, my own neurocomputational architecture for syntax, ROSE, invokes vector codes for intra-lexical representations. These are clearly an important component of a cognitively plausible neural architecture for global language processing – but they may not be sufficient.

Turning to the author’s results, Goldstein and colleague’s evidence that symbolic models are less successful is provided through the use of “75 symbolic (binary) features for every word within the text”. These include “part of speech (POS) with 11 features, stop word, word shape with 16 features, types of prefixes with 19 dimensions, and types of suffixes”. It may be that the evidence here against symbolic features being relevant to cortical computation may need to be properly contextualized; e.g., with respect to prefixes and suffixes, surely the brain at some level of neural complexity cares about affixation? Other work in cognitive neuroscience has pointed to quite striking electrophysiological sensitivity to morphological details of the kind relevant here – perhaps there are multiple relevant neural codes that differ across scales of organization?

Furthermore, animacy and non-animacy, stubby surfaces versus smooth surfaces, concrete and abstract (and complex polysemy in copredication permitting joint concrete-abstract meaning, amongst many other types of common structures found in natural language), and even syntactic category itself are not strictly binary in a way that some representations of symbolic models might suppose.

Goldstein et al.’s evaluation of their symbolic model is based on how well it predicted newly-introduced words that were not part of the training data (note: it did in fact successfully predict words, just not during zero-shot inference). While predictive processing is surely important, it may not be the case that the function of linguistic semantics and structure-building is based purely on prediction alone (e.g., what about the role of language in inference generation, cognitive/generative model updating, consolidation of experience, endogenous planning and monitoring, aiding directed attention, reflection of personal experience and values, social and moral judgment, and so forth)?

Still, the authors do nevertheless settle on a more careful and nuanced conclusion that their results can not in fact be used to undermine symbolic architectures (“We are not suggesting that classical psycholinguistic grammatical notions should be disregarded”); they indicate strictly that IFG is sensitive to contextual lexical information.

Taking a further step back here, what can we say of broader trends? The kind of cognitive science implicitly endorsed by most modern machine learning models may be in many ways obsolete and obtuse. Tomasello has argued that “the functional dimension”, or the vector dimension in machine learning, “enables certain kinds of abstraction processes, such as analogy, that can only be effected when the elements to be compared play similar functional (communicative) roles in larger linguistic expressions and/or constructions. [T]he adult endpoint of language acquisition comprises nothing other than a structured inventory of linguistic words and constructions”.

This claim may require some revision. For example, consider:

(1) John ate an apple

(2) John ate

(3) John is too stubborn to talk to Bill

(4) John is too stubborn to talk to

Sentence 2 means that John ate some arbitrary thing. But applying the same ‘inductive’ procedure to 3 and 4, we get a false reading that 4 means that John is too stubborn to talk to some arbitrary person. Yet this is not the interpretation reflexively inferred by children without extensive training or even relevant evidence.

For Gallistel and King, the “essential properties of good symbols [are] distinguishability, [unbounded] constructability, compactness, and efficacy” which inhere in discrete systems but not effectively in continuous systems (e.g., the real-valued variables of machine learning systems).

It is much more difficult to ground the language of Turing computability and symbolic representations and recursive functions into neurobiology than it is the language of modern machine learning; matrix multiplication, statistics, Bayesian probabilities, and so forth, are much more amenable to immediate injection into neural analyses – but they also may not help as much as many think when it comes to explaining linguistic computation.

Much brighter minds have considered similar limitations: Oxford physicist Roger Penrose even goes as far as to argue that human-style intelligence transcends all forms of computation. Yet even if we do not reach this conclusion, it still seems clear to me that any computational system that is limited by Gödelian logic cannot attain human-level intelligence. And machine learning systems are limited in this manner.

Of course, if one simply defines linguistic competence as assembling a memorized list of words and constructions, then we afford ourselves a neat escape trick: All that neurobiology of language needs to do is approximate unanalyzed data, a la modern machine learning methods.

The challenges ahead here are enormous. It is much easier to learn many diverse associative facts about the world than it is to derive a more constrained intensional algorithm for recursively constructing a limited type of symbolic representations, but the former tactic is the one that (mostly) dominates current GPT-style models used to explore the neurobiology of language. Approximating surface statistics can certainly be achieved with this method, but going beyond statistics and into the domain of structure is another matter entirely.


Leave a comment