LLMs—the great pretenders—or how I learned to stop worrying and build safely

Large Language Models (LLM) are hailed everywhere as game changers. They have vastly shortened development time for NLP applications, and if you’re not using them you’re losing out. We’ve seen reports of some fantastic feats by LLM: ChatGPT4 passing the Bar exam and medical examinations. This seems to pave the way for a bright new future. However, if we are to apply LLM to sensitive areas such as law, security, medicine, it makes sense to pause for thought and want to understand more about their underlying abilities.

The following text is a compendium of some of the problems we have encountered when building applications with LLM and what warnings practitioners can glean from research on the linguistic and cognitive abilities of LLM. But first:

What do we mean by underlying abilities? 

Consider for a moment a pigeon that has been trained to pick at a display of the moon on a navigator. Every time it hits the moon, it receives a grain. The navigator has been fitted with sensors which adjust the rocket’s course depending on the pigeon’s pick, thus allowing the pigeon to ‘steer’ the rocket towards the moon. Although viewed from the outside there is no discernable difference between a rocket flown by an astronaut or such a pigeon, we would not be happy to say the two pilots are using the same knowledge to do so. 

Similarly, Hans was a horse in late 19th century Germany who could perform complex arithmetic by tapping out responses to problems. He was paraded in front of learned societies and was a sensation until examiners wanted to know if he could still answer questions when asked by someone else than his handler. When this was tested, Hans’ abilities evaporated and his keeper was left distraught. Hans did have an amazing ability to read his handler and perform just the right behaviour for a given situation, but this did not constitute arithmetic. Instead he was responding to subtle surface cues from his human and tailoring his own response to that. Both examples are regularly used in introductory courses to Cognitive Science to illustrate the point that we cannot assume that identical behaviour in two organisms has also originated from the same computations – or in a more snappy way: we can’t equate behaviour with cognition.

So what are some of the indicators we’ve seen that reasoning in LLM may not be quite as human-like?

Difficulties in abstracting meaning beyond the lexical items of the text

LLM can get singularly ‘stuck’ on text. On several of our tasks, we observed that the models (mostly ChatGPT 3.5 but also 4) would respond positively as long as the same lexical items that were used in the query were also present in the text to be reviewed. It was almost as if we were using a highly powerful regex that was triggered irrespective of context and meaning. So what does research tell us about this ‘lexical fixation’?

A task that is often used to test models’ ability to interpret language is ‘Natural Language Inference’ (NLI), where we ask what a reader is able to infer from a piece of text. For example, if I tell you that ‘She ate most of the cake’, you also know that some of the cake is left. There are several benchmarks (e.g. SNLI and Multi-NLI) that test this; typically by asking ‘given the premise (first sentence), tell me if the hypothesis (second sentence) is: true (entailment), false (contradiction) or we don’t know (neutral)’. Below are some examples from Multi-NLI:

Progress on these benchmarks has been impressive for LLM, reaching human-like performance on some. However, McCoy et al. (2019) point out that the vast majority of examples in the Multi-NLI corpus could be solved by using surface strategies, i.e. paying attention to text without extracting its meaning. They give the three examples below, where if you only simply compared the sequence of content words in the first sentence and checked whether they appeared in a similar order in the second sentence, you would wrongly conclude the first entailed the second. Which is exactly what the models the authors tested did. 

Getting the model to abstract beyond the text in our prompts was not easy, and with ever more elaborate prompts, I started to question whether these were simply again harnessing clever Hans strategies. As a result, an important principle that guides our work can be summarised as: “Do not build with the assumption your model abstracts the same information from text as you do”. The model may outwardly succeed, but not for the same reasons a human would. You therefore need to build in safety checks and design tasks in a way that makes responses amenable to evaluation. (See example prompt below)

Attentional gaps

We noticed ChatGPT (3.5 and 4) seems to have attention and memory problems. When we asked it to count certain characteristics of text, it did an excellent job. It did a bad job at keeping count though and adding the number of occurrences up, as if it overlooked or forgot them. This was in spite of us using a Chain of Thought (COT) methodology. In the end traditional text analytics had to be applied in addition, and the conjunction of both did the best job. 

On a different task, when it was asked to reason about the content of text, a more worrying attentional gap was displayed: it would at times ignore the submitted text and instead tell us about what it knew on the topic, without taking into account whether this was actually relevant for the text under review.

I don’t have clear research references for the first phenomenon, although potentially Melanie Mitchell & colleagues work is of interest. If readers know of other relevant research, feel free to reach out. 

The second example is however a hot topic in current investigations: it is a known problem that LLM have a strong tendency to prioritise their internal knowledge when being asked to reason over text. As an example Basmov et al. (2024) cite the following:

Context given to model: Elon Musk is a business magnate and investor. He is the owner and CEO of Twitter.

Question asked of model: Who is the CEO of Twitter?"

Model’s answer: "Jack Dorsey"

The model was trained before Musk took over, and although it is explicitly told who the new owner is, it disregards this information and retrieves the answer from its own stored knowledge. Basmov and colleagues argue that to truly test a model’s reasoning abilities, it is necessary to create imagined scenarios to circumvent the danger of the information being part of the training set. On these novel examples, we can see that the models struggle to pay attention to or use the text they are asked to review to formulate an answer: 

Context given to model: If they were part of a progressive society, the Zogloxians would have fought for women’s rights.

Question asked of model: “What did the Zogloxians fight for?”

Model’s answer: “Women’s rights”

This informs our next principle, “Do not build systems where you can’t access the source of the model’s response” On a given source, the model may have come to an incorrect conclusion - if you don’t have access to the original, you can’t evaluate how well you are doing. Which is of course a problem for ChatGPT, as this is precisely how it functions at the moment. However, if you are using it in conjunction with e.g. RAG, you can surface the source and design prompts so that the model outputs its reasoning, again allowing evaluation. 

Below is a synthetic extract of how this can be done via prompting:

On the problem of overlooking and forgetting items to track, we found that although using COT in a manner where the model outputs all its intermediate steps did not fix the problem, it at least allowed us to pinpoint where in the chain the error was occurring, and to decide which aspects of the task could or couldn’t be used.

Synthetic example of COT output allowing localisation of errors:

A problem with ‘No’

We have noticed that LLM are much more likely to agree than to disagree on most questions they are asked. This can work to your advantage, resulting in high recall if you need to detect information, but often at the detriment of precision. 

Conversely, they also seem to have difficulties with the concept of absence, or ‘it is not the case that’. When asked to summarise a fairly simple regulatory text which contained a lot of negations, the model produced an over-generalisation which was so broad that it was factually incorrect and turned a lot of the negations into positive sentences. 

Problems with negation were established by Ettinger (2020) as far back as BERT. Recent research has found that newer and larger models still have similar difficulties: Truong et al. (2023) report specific failings on benchmarks testing negation: NLI tasks which contain negation are problematic, here the model failed to realise the first sentence entails the second:

Premise: They watched me constantly for weeks.

Hypothesis: They did not leave me on my own for weeks.

Label: Entailment

(Notice also how for this example, a clever Hans strategy would not work.)

On sentence completion tasks (the model is asked to complete the sentence with a word for the mask), models' completions seemed to overlook the negation and produce the most frequent sentence instead:

Query: Ibuprofen isn't a kind of [MASK]. 

Wrong completions: NSAID, painkiller, drug, medicine.

Finally, Dentella et al. (2023) report the same yes-response bias we observed; with models being below chance at indicating ungrammatical sentences whereas yes responses to grammatical sentences had good accuracy.

(Figure reproduced from Dentella et al, 2023)

From these findings and our experience we would urge caution when your task involves the interpretation of negatives. An analysis of the linguistic constructions needed for a task may be needed. Generally, it is not a good idea to devolve decision-making and complex reasoning to the AI. Instead, break tasks down into small constituents so that every step can be tested and breakdowns can be identified. Factor in the strong yes-bias when designing your prompts. 

Unstable responses depending on input features

Probably the most troubling aspect of LLM responses is their sensitivity to input features which carry no semantic information: we observed unstable responses depending on whether questions contained words which were typed with American or UK spellings (American spelling achieved slightly better results).

Sclar et al. (2023) examined the effect of spurious formatting features which did not alter the semantic content of the question on the accuracy of response. The figure below (reproduced from their article) gives examples of the small changes in formatting and associated changes in performance, which varied as widely as 0.036 to 0.804 accuracy.

We believe GenAI projects should routinely test for the stability of results. Moreover, methods for quantifying robustness of knowledge should not remain in the domain of research, but need to start being incorporated into development work. This information is essential if we are to go beyond the hype, and use GenAI in controlled applications.

In conclusion, we would argue that this short review should dispel the notion that LLM possess human-like reasoning abilities. Instead they look more like reasoning emulators. (They clearly do compute some latent representations for text, but their nature is likely to be both quite different from humans’, while also containing similar aspects.) We should take a critical approach when building applications and should not assume that LLM have necessarily overcome the problems of old NLP techniques. We have found surprisingly few direct comparisons between the classical NLP pipeline and a pipeline using GenAI, which in itself is telling. 

References

Basmov, V., Goldberg, Y., & Tsarfaty, R. (2024). LLMs' Reading Comprehension Is Affected by Parametric Knowledge and Struggles with Hypothetical Statements. arXiv preprint arXiv:2404.06283.

Dentella, V., Günther, F., & Leivada, E. (2023). Systematic testing of three Language Models reveals low language accuracy, absence of response stability, and a yes-response bias. Proceedings of the National Academy of Sciences, 120(51), e2309583120.

Ettinger, A. (2020). What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models. Transactions of the Association for Computational Linguistics, 8, 34-48.

Feather, J., Leclerc, G., Mądry, A., & McDermott, J. H. (2023). Model metamers reveal divergent invariances between biological and artificial neural networks. Nature Neuroscience, 26(11), 2017-2034.

Lewis, M., & Mitchell, M. (2024). Using counterfactual tasks to evaluate the generality of analogical reasoning in large language models. arXiv preprint arXiv:2402.08955.

McCoy, R. T., Pavlick, E., & Linzen, T. (2019). Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. arXiv preprint arXiv:1902.01007.

Sclar, M., Choi, Y., Tsvetkov, Y., & Suhr, A. (2023). Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting. arXiv preprint arXiv:2310.11324.

Truong, T. H., Baldwin, T., Verspoor, K., & Cohn, T. (2023). Language models are not naysayers: An analysis of language models on negation benchmarks. arXiv preprint arXiv:2306.08189.

More insights

Call us when you're ready