Dive Brief:
- Large language models produce hallucinations between 69% and 88% of the time when queried about a legal matter, a Stanford University study finds.
- “Legal hallucinations are pervasive and disturbing,” authors of the study say. What’s more, “these models often lack self-awareness about their errors and tend to reinforce incorrect legal assumptions and beliefs.”
- The more complex the legal query, like measuring the precedential relationship between two cases, the more likely the models will produce a hallucination. “Most LLMs do no better than random guessing,” the report finds. “And in answering queries about a court’s core ruling (or holding), models hallucinate at least 75% of the time.”
Dive Insight:
Hallucinations have been in the news since a high-profile incident last year in which two lawyers were reprimanded for submitting a court filing that contained made-up cases. Most recently, former Donald Trump lawyer Michael Cohen submitted a filing that contained made-up citations as part of his effort to end his at-home confinement.
The Stanford test looks at LLMs but not purpose-built legal applications that sit on top of the LLMs. Applications that are built specifically for legal functions can be expected to perform better than raw LLMs by having more targeted data available and using more informative search parameters, among other things.
The authors of the report are specialists in data science, computer engineering, the law and social sciences affiliated with Stanford’s Regulation, Evaluation and Governance Lab. They partnered with the Institute for Human-Centered AI to look at the most popular generative AI tools that use LLMs to produce natural language responses to user queries: GPT 3.5, Llama 2 and PaLM 2.
After conducting extensive research on the tools, they conclude that use of the technology in a legal context must be closely monitored, at least in the technology’s current state.
For raw LLMs — not necessarily purpose-built legal tools that sit on top of the LLMs — users should have “significant concerns about the reliability of LLMs in legal contexts, underscoring the importance of careful, supervised integration of these AI technologies into legal practice,” they say.
Among the findings:
Case law from lower courts is subject to more frequent hallucinations than case law from higher courts like the Supreme Court.
LLMs struggle with localized legal knowledge that is often most important in lower court cases.
The models have problems with Supreme Court rulings, too. Hallucinations are most common among the court’s oldest and newest cases, and perform best among later 20th century cases. “This suggests that LLMs’ peak performance may lag several years behind current legal doctrine, and that LLMs may fail to internalize case law that is very old but still applicable and relevant law,” the report says.
The risks of using LLMs for legal research are especially high for:
- Litigants in lower courts or in less prominent jurisdictions.
- Individuals seeking detailed or complex legal information.
- Users formulating questions based on incorrect premises.
- Those uncertain about the reliability of LLM responses.
“In essence, the users who would benefit the most from legal LLM are precisely those who the LLMs are least well-equipped to serve,” the report says.
Access a summary of the findings. Access the report itself: Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models.