Will AI Hallucinate Case Law? How to Use Legal AI Safely
Lawyers were sanctioned $5,000 when ChatGPT invented fake cases. Here is what AI hallucination means for your firm and how to use legal AI safely.
In 2023, a federal judge in the Southern District of New York published an opinion that became required reading for anyone thinking about AI in legal practice. The attorneys representing Roberto Mata in Mata v. Avianca had used ChatGPT to research their motion. The model produced six cases to support their arguments. None of those cases existed. The judge fined the firm $5,000 and wrote that the court was presented with an unprecedented circumstance.
That precedent settled one question definitively: AI can and does hallucinate legal citations, and the consequences fall on the attorney, not the software company.
The question worth spending time on now is not whether AI hallucination is real, but how it works, which tools manage it better than others, and what your firm needs to do to use AI for research without ending up in a sanctions opinion.
What "Hallucination" Actually Means
The word is imprecise but the mechanism is specific. General-purpose large language models like ChatGPT and GPT-4 are trained by ingesting enormous amounts of text and learning statistical patterns across that text. They generate responses by predicting what text is likely to follow a given prompt. That works well for summarizing, drafting, and many other tasks. For legal citations, it breaks.
When a model has no exact citation in its training data to draw on, it does not say "I do not know." It generates what a citation would plausibly look like: a realistic-sounding case name, a plausible court, a credible year, a quote that fits the argument. Every element is statistically coherent. None of it is real.
This is not a bug in the sense of something developers missed. It is a structural property of how these systems work. Models that are not explicitly designed to suppress hallucination, or that are not grounded in an authoritative source at query time, will produce fabricated citations in a significant share of legal research tasks.
What the Research Shows
A study published in the Journal of Empirical Legal Studies tested the major AI-assisted legal research tools specifically on their hallucination rates. The results matter for any firm considering these tools.
Among the tools examined, Lexis+ AI had a hallucination rate of approximately 17 percent and answered around 65 percent of queries accurately. Westlaw AI-Assisted Research had a higher hallucination rate of around 33 percent and an accuracy rate of 42 percent. GPT-4, tested without any legal grounding layer, hallucinated on roughly 43 percent of queries.
The study also surfaced a more troubling category: misgrounded responses. A misgrounded response cites a real case, but that case does not actually support the claim the AI attributed to it. In some ways this is more dangerous than an obviously fabricated citation, because it passes a quick existence check. An attorney verifying that a case exists might not read carefully enough to notice that the case does not say what the brief claims it says.
The takeaway is not that you should avoid all AI legal research tools. It is that the tools vary considerably in how well they manage this problem, and that even the best ones require human verification for anything consequential.
The Difference Grounded AI Makes
The term grounded AI describes systems that do not rely solely on the model's learned patterns. Instead, they retrieve actual documents at query time and anchor their responses to what those documents say.
The underlying architecture is called Retrieval-Augmented Generation, or RAG. Here is how it works in plain terms. When you ask a RAG-powered legal research tool about a case or a doctrine, the system does not just ask the language model what it remembers. It first queries a database of real legal documents, pulls the relevant passages, and feeds those passages to the model along with your question. The model then generates a response grounded in the actual retrieved text and cites back to those documents.
The practical result is that the model cannot freely invent a citation, because its answer is built from documents that actually exist in the database. The quality ceiling is now set by what is in that database, not by what the model happens to have learned.
Tools like CoCounsel, built by Casetext and now part of Thomson Reuters, use this approach. When CoCounsel cites a case, it is citing a document from Westlaw's proprietary legal database. Harvey, which serves larger firms, uses a similar RAG architecture with citation verification designed to flag any output it cannot trace to a source. The result is meaningfully better than asking a general-purpose model a legal research question, though even these systems require attorney review.
The distinction worth keeping in mind: a tool that says "here is an answer with citations drawn from our verified database" behaves very differently from a tool that says "here is an answer from a model trained on general web text." Both might give you a citation. Only one of them has any real accountability to that citation.
What to Ask Before You Use Any Legal AI Tool
Before your firm commits to an AI research tool, these questions narrow the field quickly.
Where do the citations come from? If the vendor cannot tell you exactly which database their citations are drawn from, or if the answer is "the model learned from a large corpus of legal text," treat that as a high-risk answer. The grounded tools can be specific: Westlaw, Lexis, the firm's own document library.
What happens when the model is uncertain? Some tools are designed to refuse or flag rather than guess. Ask what the tool does when it cannot find a real supporting citation. A tool that says "I could not find relevant case law on that question" is far more useful than one that generates a plausible-sounding citation it cannot verify.
Can you see the source text, not just the citation? Good tools surface the actual passage from the underlying document alongside the citation. This lets you read the source in context rather than trusting the AI's characterization of it.
What are the data handling terms? Before putting any client matter into an AI tool, you need to know whether the vendor trains their models on your queries and documents, and what their data retention and access policies are. This is not optional under ABA Model Rule 1.6 on confidentiality. Most enterprise legal AI tools offer explicit contractual terms on data isolation and prohibit training on customer data. Verify this in writing before proceeding.
What ABA Formal Opinion 512 Requires
The American Bar Association issued Formal Opinion 512 on July 29, 2024, its first substantive ethics guidance on generative AI in legal practice. The opinion does not ban AI use, but it sets out what lawyers must do to use it ethically.
The opinion identifies six areas of ethical concern: competence under Model Rule 1.1, communication with clients under Model Rule 1.4, confidentiality under Model Rule 1.6, candor toward the tribunal under Model Rules 3.1 and 3.3, supervisory obligations under Model Rules 5.1 and 5.3, and fee practices.
On competence, the opinion states that lawyers do not need to become AI engineers, but they do need to understand the capabilities and limitations of any AI tool they use. Submitting a brief containing AI-generated citations without independent verification fails the competence standard under Rule 1.1.
On candor toward the tribunal, the obligation is straightforward. Under Rule 3.3, a lawyer may not knowingly make false statements of law to a court. A fabricated citation is a false statement of law, regardless of whether the lawyer knew it was fabricated. Ignorance of the model's limitations is not a defense. This is exactly why the Mata v. Avianca sanctions fell on the attorneys, not on OpenAI.
On confidentiality, Opinion 512 requires lawyers to evaluate whether a vendor's data practices are consistent with their obligations to clients. The opinion suggests lawyers must conduct reasonable due diligence before using any AI tool on client matters. That means reviewing data handling agreements, not just clicking through terms of service.
The full opinion is available on the ABA's website and is worth reading in full if you are making purchasing or policy decisions for your firm.
A Practical Workflow for AI-Assisted Legal Research
This is not an argument against using AI for legal research. Used correctly, these tools save real time on the exploratory phase of a research task. The point is building a workflow where the AI does the draft work and a human closes the loop before anything reaches a filing.
Start with a grounded tool. Use a legal research platform that retrieves from a known database, such as CoCounsel or a Westlaw AI-assisted product, rather than a general-purpose model for citation work.
Treat every citation as a draft, not a finding. Pull the actual case, read the relevant passage, and confirm that the case says what the AI claims it says. This is the step the Mata v. Avianca attorneys skipped. It takes two to three minutes per citation and is non-negotiable.
Limit general AI tools to non-citation tasks. ChatGPT, Claude, and similar general-purpose models can be useful for drafting arguments, summarizing documents you have already pulled, or brainstorming angles. They should not be the source of citations you file.
Never paste a client matter into a general AI tool without checking the data terms. Many general-purpose tools default to using conversations for model improvement unless you opt out or use an enterprise plan with different terms. Check before using.
Document your verification. Some firms are beginning to keep a brief verification log for AI-assisted research, noting which citations were independently confirmed and by whom. This is partly risk management and partly a response to supervisory obligations under Rule 5.3, which Opinion 512 suggests may extend to AI tools used as nonlawyer assistants.
Use AI earlier in the process, not at the finish line. AI is most useful during the exploratory phase: surfacing potentially relevant doctrines, identifying leading cases to pull, drafting a first pass at an argument structure. It is least safe as the final step, because that is when fabricated citations are most likely to slip through without a second look.
The Practical Question Your Firm Needs to Answer
The Mata v. Avianca opinion, the JELS hallucination study, and ABA Formal Opinion 512 are not arguments against AI in legal practice. They are a map of where the real risks sit and what responsible use actually requires.
Firms that run into trouble are typically the ones that hand AI output to a paralegal or junior associate without a clear protocol for verifying citations, or that assume a legal-branded tool cannot hallucinate. The firms that use this well are the ones that choose tools designed to be grounded in real legal sources, build verification into the workflow before anything reaches a filing, and treat AI output as a starting point rather than a finished product.
At Futureman Labs, we help small and mid-size firms work through the practical questions: which tools fit your existing workflow, what a defensible AI use policy looks like, and where the highest-leverage automation opportunities are. The free Law Firm AI Readiness Scorecard is a good starting point if you want a clear picture of where your firm stands today.
Is your firm AI-ready?
Take the free Law Firm AI Readiness Scorecard. Get a grounded, practical report on where AI safely saves your firm time, and where it is a liability.
Want to cut through the AI hype?
Start with the free Law Firm AI Readiness Scorecard. Two minutes, and you will see exactly where to start and what to avoid.
Related Articles
AI Answering Services for Law Firms: What Actually Works
A practical guide to AI answering services for law firms: how they work, what they get wrong, the ethics rules that apply, and how to set one up safely.
Workflow Automation for Sales Teams: Lead Routing, CRM Updates, and Follow-Ups
Practical guide to automating the three pillars of sales ops: lead routing, CRM hygiene, and follow-up sequences using n8n workflows.
How to Build an AI-Powered Review Monitoring and Response System
A technical guide to building an automated review monitoring system with AI sentiment analysis and auto-responses across Google, Trustpilot, and Shopify.
How to Audit Your Ecommerce Operations for Automation Opportunities
A step-by-step framework to audit your ecommerce operations, score automation opportunities by ROI, and build a prioritized automation roadmap.
How to Automate Product Data Enrichment for Shopify with AI
Learn how to automate product data enrichment for Shopify using AI-generated descriptions, SEO metadata, tags, and alt text with n8n and OpenAI.
How to Automate Shopify Customer Segmentation with AI-Powered RFM Scoring
Build an automated RFM segmentation pipeline that scores every Shopify customer by recency, frequency, and monetary value — then syncs segments to Klaviyo.