Robots v. Rube Goldberg Machines: How AI Helps Solve The Precision and Recall Gap

Back to Blog Posts

Artificial intelligence (AI) and technology-assisted review (TAR) have emerged as powerful tools in the legal industry, offering better document precision and recall than traditional keyword searches. However, some lawyers remain skeptical about the use of AI and TAR in document reviews, fearing the potential for court sanctions or increased costs. In this blog, we will explore the reasons why AI-led document reviews are superior to keyword searches.

📚 Related reading: Alternate Legal History: If Enron Had DISCO

How do you measure the effectiveness of information retrieval in the discovery process?

First, let’s level-set on some terminology that will be important to understand the relative effectiveness of keywords v. technology-assisted review: precision and recall. "Recall is the fraction of relevant documents identified during a review; precision is the fraction of identified documents that are relevant. Thus, recall is a measure of completeness, while precision is a measure of accuracy or correctness."

Precision: the proportion of relevant documents among the total number of documents that were retrieved in a search or review process. This image illustrates a 72% precision rate.

A high precision score means that the majority of documents identified as relevant are indeed relevant, while a low precision score indicates that many irrelevant documents were also identified as relevant. In other words, precision measures how accurate a given review process is in identifying relevant documents.

Recall: the proportion of relevant documents retrieved compared to the total number of relevant documents in the collection. This image shows 100% recall.

On the other hand, recall refers to how many of the potentially relevant documents were actually reviewed. A high recall score means that most relevant documents were identified, while a low recall score indicates that many relevant documents were missed.

Do I really need to do testing to show the effectiveness of my search techniques?

Yes, you really do. Testing and validation techniques, like precision and recall metrics, across all forms of information retrieval, be it keywords or technology-assisted review, are widely recognized as not only beneficial but necessary.

For example, the court in City of Rockford v. Mallinckrodt ARD Inc. cited dozens of pieces of legal literature and prior case law discussing the purposes of testing, its reasonableness in providing quality, proportional results, and its historical use in keyword-based reviews. In William A. Gross Constr. Assocs., Inc. v. Am. Mfrs. Mut. Ins. Co. the court went as far as saying that: “what is required is something other than a lawyer's guesses . . . without any quality control testing.” The need for testing in the use of keywords is echoed by the Sedona Conference as well as commentators across the legal space, such as Craig Ball, Raymond Biederman and Sean Burke, and Doug Austin.

When it comes to TAR, parties have great success in using precision and recall metrics to demonstrate the proportionality and defensibility of their chosen review methodology. For example, in Lawson v. Spirit AeroSystems the court sided with Spirit that cutting off their review with an 80% recall rate was proportional, refusing to order Spirit to produce the remaining unreviewed documents. Courts have found TAR processes achieving a 75% recall rate to be appropriate. And a 2017 Pocket Guide to TAR for federal judges states, "[A] recall or precision of 80% may be appropriate for one particular review, this does not mean that 80% is a benchmark for all other reviews," highlighting the need for individualized proportionality analysis in each case, but also highlighting the importance of testing.

Exploring the Precision and Recall Gap: Why AI Outperforms Search Terms

Search terms can be useful in identifying relevant documents, but they have limitations. Humans are not very good at crafting keywords to find relevant information, particularly when dealing with data like text messages, emails, or social media, where abbreviations and typos are highly likely. There are several reasons for this:

  1. Bias and subjectivity: Humans tend to have biases and subjective opinions that can influence the choice of keywords they use. What one person considers relevant may not be relevant to another person.
  2. Inability to anticipate language variations: Humans are not able to anticipate all the possible variations in the language that may be used in the data. For example, they may not know all the abbreviations or acronyms that are commonly used in a particular industry, company, or even a particular department.
  3. Typos and misspellings: Humans may not be able to account for the varying typos or misspellings when crafting keywords. These errors can cause relevant data to be missed, or irrelevant data to be included in the search results.
  4. Time constraints: When under time pressure, humans may not have enough time to carefully craft the keywords and may end up using generic or vague terms that do not effectively narrow down the search results.

Noted legal scholar and former US Magistrate Judge, Andrew Peck, has been a vocal critic of relying solely on keywords to identify relevant documents. In his article, "Search, Forward," he compared the process of relying on keywords to playing "Go Fish." Instead, Peck recommends parties use more sophisticated search techniques, like technology-assisted review, to increase the efficiency and accuracy of their document review. He’s joined in his critique of keywords by other former magistrate judges, Facciola and Grimm.

In a 2009 article discussing the limitations and problems with keyword searches, Gregory Fordham analogized the process of keyword searching to a Rube Goldberg (a 20th-century cartoonist who is perhaps best known for his depictions of complex devices doing simple tasks in convoluted ways) device:

 “After reviewing the preceding cases and realizing that a properly designed keyword search methodology could include features like iterative testing, sampling, Boolean logic, proximity locators, stemming, fuzzy logic, thesauri, synonyms, statistical analysis, etc., the litigator may well feel like a cog in one of Goldberg’s devices.” 

In other words, keyword searches can work, but they can also create an unnecessarily complex system to reach results that could be accomplished through simpler means.

In contrast, TAR 2.0 workflows use statistical models that are trained on the entire dataset, allowing them to identify relevant documents based on patterns and relationships that may not be apparent to human reviewers. This approach is particularly effective when dealing with large and complex datasets, where search terms may miss important documents or produce too many false positives.

Maura Grossman and Gordon Cormack, whose work centers on the intersection of information retrieval, technology, and the law, conducted a series of experiments comparing different review methods. Their work demonstrated that “technology-assisted review can achieve at least as high recall as manual review, and higher precision, at a fraction of the review effort.” And in a subsequent study leveraging work from the TREC Legal Track and meta-analysis of other precision and recall studies concluded: 

“The literature reports that TAR methods using relevance feedback can achieve considerably greater than the 65% recall and 65% precision reported . . . as the practical upper bound on retrieval performance . . . since that is the level at which humans agree with one another.”

What is TAR 2.0?

TAR 2.0 is an advanced form of document review that leverages artificial intelligence algorithms to augment the decision-making of lawyers and legal professionals, bringing automation to the document review process. It uses statistical models to classify documents based on input from the case team regarding their relevance to the case, allowing reviewers to prioritize their efforts on the most relevant documents.

TAR 2.0 workflows also provide a more consistent and transparent approach to document review. Human reviewers may have biases or inconsistencies in their review process, leading to errors or inconsistencies in their findings. In contrast, TAR 2.0 workflows provide a standardized and repeatable approach, allowing for greater consistency and accuracy in the review process.

How does TAR 2.0 work?

TAR 2.0 workflows start by training a machine learning model on a sample of the documents, using feedback from human reviewers. The model learns to classify documents based on their relevance to the case and then applies this classification to the rest of the documents in the dataset.

The reviewers can then use the model's predictions to prioritize their review efforts, focusing on the documents that are most likely to be relevant. As the reviewers provide feedback on the model's predictions, the model continues to learn and refine its predictions.

What’s the upshot?

TAR 2.0 workflows offer significant advantages over traditional review methods that rely on search terms. By leveraging machine learning and artificial intelligence algorithms, TAR 2.0 workflows provide a more accurate and efficient approach to document review, with greater precision and recall of relevant documents. As a result, TAR 2.0 workflows can save time and reduce costs, while improving the quality of the review process.

Interested in learning more about how AI and TAR 2.0 workflows can enhance your discovery efforts? Book time with one of our AI consultants for expert advice and custom workflow designs.

Subscribe to the blog
Caitlin Ward

Caitlin Ward is a product marketing manager at DISCO. She has more than a decade of experience leading ediscovery initiatives and advocating for the adoption of legal tech as an attorney. Since joining DISCO, she focuses on helping lawyers innovate to overcome their ediscovery and case management challenges.