For anyone currently building knowledge retrieval systems, a recent study (Magesh et al., under review1) evaluating proprietary legal LLM systems augmented with RAG merits attention.
The authors set out to evaluate AI-driven legal research tools by three major providers of proprietary legal software: specifically, they investigated the claim that the addition of Retrieval Augmented Generation (RAG) prevents hallucinations by the LLM. In RAG systems, the capacities of LLM are enhanced by giving them access to a querying system over a Vector Database of text embeddings2.
The RAG tools tested by Magesh et al. (under review) consist of the latest LLM supplemented with proprietary legal data. In order to establish the effect of RAG, responses by a general purpose LLM (Chat GPT 4.0) without access to an additional vector database of legal reference documents were also examined.
Examining on questions that have real-world equivalents
The authors explain that while previous legal benchmarking work assessed the AI’s legal knowledge or its capacity for legal reasoning, their study examines the system’s ability to determine and retrieve relevant content for a question and to use that content to provide a correct answer. To assess this, the authors devised 202 questions designed to reflect queries that are likely to arise during legal research. The questions were broadly divided into four categories:
- General legal research questions: common-law doctrine questions, holding questions, or bar exam questions.
- Jurisdiction or time-specific questions: questions about circuit splits, overturned cases, or new developments.
- False premise questions: questions where the user has a mistaken understanding of the law.
- Factual recall questions: queries about facts of cases not requiring interpretation, such as the author of an opinion, and matters of legal citation.
(Magesh et al., under review, p.11)
Teasing apart the origin of errors
To be able to explore the effect of RAG, responses were first classified into correctness and then assessed for groundedness. Correctness indicates whether a response is factually true. Note that a model could give a correct response, but base this on incorrect text or interpret the text in an inaccurate way. Even though the right answer was returned, this would still be classified as a hallucination. To count as correct, a response therefore needs to be both correct and based on relevant text. Correct answers are grounded if a valid inference is made from a relevant text. If the system retrieves irrelevant information or misinterprets the retrieved text, the answer is a hallucination. Using these criteria, human raters scored the models' responses.
Transparent methods and analysis
A refreshing aspect of this study is that the questions and test data are publicly available, and that analyses were pre-registered according to open science principles3. This prevents selective analyses or only reporting partial results to achieve a particular impression of abilities. In addition, agreement between the human raters was examined, with results indicating good agreement (Cohen’s kappa of 0.77 and inter-rater agreement of 85.4%) and an expectation that these findings are stable.
Results
So how did the different providers do? The good news is that RAG does have a positive effect, in that the systems which had RAG had fewer hallucinations than using only an LLM. The bad news is that in both types, hallucinations still occur at a rate of 17-33% for proprietary systems with RAG, and 43% for ChatGPT-4 on its own without access to an additional database.
A look at some egregious examples
What do some of the hallucinations look like? The paper provides interesting examples, some of which look like pure fabrications, e.g.
But others are more insidious, for example citing correct sources but claiming the opposite of the actual outcome:
Or clearly misinterpreting language, leading to an incorrect application of law: for example here, where the text states a particular situation need not be present for the law to apply, and the LLM renders this as the law only applying if the particular situation is the case!
The compendium of errors reported in the paper is fascinating to read, and I would direct interested parties to the original paper. While the authors do not further elaborate on the nature and cause of the errors, future research on the linguistic characteristics of text that triggered misinterpretations by the LLM would be illuminating, as there are doubts about how much true understanding lies behind the text generation of LLMs.
Conclusions
The problem of skewed incentives
What can be concluded from this? Firstly, independent evaluations of proprietary systems are important and non-trivial, since many SaaS GenAI solutions do not acknowledge this is an issue, instead asking for blind trust. Indeed, companies building systems may not have as much incentive and domain knowledge to evaluate their system comprehensively: in terms of profitability, a quick launch may be preferred over a longer investigative period to understand the advantages and drawbacks of a tool. Secondly, it may be that a system achieves a very high performance on a particular type of tightly constrained question, whilst failing quite spectacularly on the wider variety of documents and queries it will be exposed to in real-world applications. Understanding of these limits can only be achieved through thorough testing and end-user input.
Finally, these figures should give anyone working with RAG systems pause for thought. Not only on hallucination rates, but the accuracy itself suggests these tools should not be relied on alone. For the time being, RAG isn’t your save-all, as hallucinations look like they are here to stay. Instead, we need to rethink how we build: we need to be able to quantify how often hallucinations are likely to occur given a particular task, which means extensive testing before releasing to use. Systems should always quote the source on which their answer is based to allow verification, we simply cannot let them run by themselves.
So how do we build responsibly?
This does of course mean claims of increased or broader access to legal knowledge through LLMs are premature. Instead it seems that to draw benefits from these systems, you already need to have good legal knowledge. The authors of the study also note this makes efficiency gains harder to argue for: since it transpires that every statement of the LLM needs to be verified, this potentially counteracts the time saved by the application. At Datasparq we would argue this effect can potentially be alleviated or even prevented by building in such a way that verification is instantly available to the user. Moreover, if the tool is never designed to be autonomous in the first place, but only an aid to the professional, this trap may be avoided.
A new distinction to assess GenAI tools?
The question around efficiency makes me wonder whether it may be useful to introduce the efficacy/effectiveness distinction4 used in clinical research to the assessment of AI systems: efficacy trials measure whether a drug/intervention produces a desired result under ideal conditions (generally the lab), whereas effectiveness studies measure the effect in real world settings (e.g. the clinic). In the same way, it may be necessary to assess first whether an AI tool can produce the desired behaviour and in a second step if this behaviour can be successfully embedded within a system that actually produces value for its users.
⎯
- Magesh, V., Surani, F., Dahl, M., Suzgun, M., Manning, C. D., & Ho, D. E. (2024). Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools. arXiv preprint arXiv:2405.20362.
- For an expanded explanation, see eg. Advanced RAG on Hugging Face documentation using LangChain - Hugging Face Open-Source AI Cookbook.
- This may seem trivial, but is in fact a big step towards transparency when considering that proprietary systems generally only publish the results of their own internal assessments without the materials or analyses which led to their figures. This situation has prompted industry groups to launch their own benchmarking initiative for GenAI tools, e.g. Industry Collaboration around Benchmarking/Tracking Gen AI Accuracy etc — Litig - Legal IT Innovators Group.
- Gartlehner, G., Hansen, R. A., Nissman, D., Lohr, K. N., & Carey, T. S. (2010). Criteria for distinguishing effectiveness from efficacy trials in systematic reviews.