For anyone currently building knowledge retrieval systems, a recent study (Magesh et al., under review1) evaluating proprietary legal LLM systems augmented with RAG merits attention.
The authors set out to evaluate AI-driven legal research tools by three major providers of proprietary legal software: specifically, they investigated the claim that the addition of Retrieval Augmented Generation (RAG) prevents hallucinations by the LLM. In RAG systems, the capacities of LLM are enhanced by giving them access to a querying system over a Vector Database of text embeddings2.
The RAG tools tested by Magesh et al. (under review) consist of the latest LLM supplemented with proprietary legal data. In order to establish the effect of RAG, responses by a general purpose LLM (Chat GPT 4.0) without access to an additional vector database of legal reference documents were also examined.
The authors explain that while previous legal benchmarking work assessed the AI’s legal knowledge or its capacity for legal reasoning, their study examines the system’s ability to determine and retrieve relevant content for a question and to use that content to provide a correct answer. To assess this, the authors devised 202 questions designed to reflect queries that are likely to arise during legal research. The questions were broadly divided into four categories:
(Magesh et al., under review, p.11)
To be able to explore the effect of RAG, responses were first classified into correctness and then assessed for groundedness. Correctness indicates whether a response is factually true. Note that a model could give a correct response, but base this on incorrect text or interpret the text in an inaccurate way. Even though the right answer was returned, this would still be classified as a hallucination. To count as correct, a response therefore needs to be both correct and based on relevant text. Correct answers are grounded if a valid inference is made from a relevant text. If the system retrieves irrelevant information or misinterprets the retrieved text, the answer is a hallucination. Using these criteria, human raters scored the models' responses.
A refreshing aspect of this study is that the questions and test data are publicly available, and that analyses were pre-registered according to open science principles3. This prevents selective analyses or only reporting partial results to achieve a particular impression of abilities. In addition, agreement between the human raters was examined, with results indicating good agreement (Cohen’s kappa of 0.77 and inter-rater agreement of 85.4%) and an expectation that these findings are stable.
So how did the different providers do? The good news is that RAG does have a positive effect, in that the systems which had RAG had fewer hallucinations than using only an LLM. The bad news is that in both types, hallucinations still occur at a rate of 17-33% for proprietary systems with RAG, and 43% for ChatGPT-4 on its own without access to an additional database.
What do some of the hallucinations look like? The paper provides interesting examples, some of which look like pure fabrications, e.g.
But others are more insidious, for example citing correct sources but claiming the opposite of the actual outcome:
Or clearly misinterpreting language, leading to an incorrect application of law: for example here, where the text states a particular situation need not be present for the law to apply, and the LLM renders this as the law only applying if the particular situation is the case!
The compendium of errors reported in the paper is fascinating to read, and I would direct interested parties to the original paper. While the authors do not further elaborate on the nature and cause of the errors, future research on the linguistic characteristics of text that triggered misinterpretations by the LLM would be illuminating, as there are doubts about how much true understanding lies behind the text generation of LLMs.
What can be concluded from this? Firstly, independent evaluations of proprietary systems are important and non-trivial, since many SaaS GenAI solutions do not acknowledge this is an issue, instead asking for blind trust. Indeed, companies building systems may not have as much incentive and domain knowledge to evaluate their system comprehensively: in terms of profitability, a quick launch may be preferred over a longer investigative period to understand the advantages and drawbacks of a tool. Secondly, it may be that a system achieves a very high performance on a particular type of tightly constrained question, whilst failing quite spectacularly on the wider variety of documents and queries it will be exposed to in real-world applications. Understanding of these limits can only be achieved through thorough testing and end-user input.
Finally, these figures should give anyone working with RAG systems pause for thought. Not only on hallucination rates, but the accuracy itself suggests these tools should not be relied on alone. For the time being, RAG isn’t your save-all, as hallucinations look like they are here to stay. Instead, we need to rethink how we build: we need to be able to quantify how often hallucinations are likely to occur given a particular task, which means extensive testing before releasing to use. Systems should always quote the source on which their answer is based to allow verification, we simply cannot let them run by themselves.
This does of course mean claims of increased or broader access to legal knowledge through LLMs are premature. Instead it seems that to draw benefits from these systems, you already need to have good legal knowledge. The authors of the study also note this makes efficiency gains harder to argue for: since it transpires that every statement of the LLM needs to be verified, this potentially counteracts the time saved by the application. At Datasparq we would argue this effect can potentially be alleviated or even prevented by building in such a way that verification is instantly available to the user. Moreover, if the tool is never designed to be autonomous in the first place, but only an aid to the professional, this trap may be avoided.
The question around efficiency makes me wonder whether it may be useful to introduce the efficacy/effectiveness distinction4 used in clinical research to the assessment of AI systems: efficacy trials measure whether a drug/intervention produces a desired result under ideal conditions (generally the lab), whereas effectiveness studies measure the effect in real world settings (e.g. the clinic). In the same way, it may be necessary to assess first whether an AI tool can produce the desired behaviour and in a second step if this behaviour can be successfully embedded within a system that actually produces value for its users.
⎯