When you choose a language model, remember that size isn’t everything

Dawid Robert Kotur is co-founder and CEO and Nick Long is CTO at Curvestone. Views are the authors’ own.

In light of the OpenAI drama last year, a lot of companies will start questioning the firm’s market dominance in 2024, and come to the realization that it might be too risky to bet on a single provider for LLM infrastructure.

But they’re the best and biggest, right? They might ask. Actually, when it comes to large language models (LLMs), there's a common misconception: bigger is always better.

While expensive models undeniably have their merits, success with LLMs in the business environment hinges much more on matching a model's capabilities with the specific requirements of the task.

In parallel, the industry we operate in, legal services, is highly regulated and notoriously inflexible so adopting new software takes a very long time. And swapping that software with a new one is even more problematic — neither in-house counsel or law firms like iterative change. In short, they can’t adapt or evolve at the same pace as the technology itself.

But at the same time, legal professional services is one of the sectors where LLMs are going to have the most impact. So basically, the cost of getting your LLM wrong is high.

For law firms and in-house legal departments who have been betting heavily on OpenAI, we wanted to share some insights into how to approach costing out and choosing the right LLM, and most importantly, how to avoid getting tied into a single one — especially if its fluctuating leadership direction ends up impacting its life span.

The real cost of running LLMs

What’s important to understand here is that deploying commercial LLMs is not just about computational power and performance. Behind the scenes, it also involves significant expenses, such as research and training costs.

Dawid Robert Kotur

Courtesy of Curvestone

Training models like GPT, Llama, or Alpaca from scratch is a multi-million-dollar endeavor. A case in point is a French company that recently secured a $100 million funding, with the lion's share of that money dedicated to training. Such substantial investments by providers inevitably trickle down, with businesses typically absorbing the added costs.

Many CTOs and CIOs at legal firms or tech savvy in-house legal professionals who have hands-on experience deploying LLMs can attest to these costs. They've navigated the intricate, and often expensive, waters of integrating LLMs through APIs and leveraging them for complex workflows.

But others may not have that knowledge — and as businesses scale, the financial implications of these decisions amplify. An understanding of how to choose the most budget-efficient approach to using LLMs before building any kind of infrastructure will save huge resources in the long run.

LLM types for legal applications: it’s all about the right configuration

The first essential thing is to evaluate the volume of queries or documents you intend to process using an LLM. Even if it’s slower or more expensive, you might want to start with the most high performance model while you test out your use cases.

If you are planning to process high volumes of queries or documents, you can then explore if a smaller model might get the result more quickly or cost effectively, which is often the case.

Nick Long

Courtesy of Curvestone

Be mindful however, that not every computational query demands an extensive LLM "context window" — the model's memory. So resist the urge to save money by completely filling the context window of a model with a very large amount of information.

Some of the newer models like GPT-4 Turbo have very large context windows, and it’s tempting to try and get more done with a single query by filling them to their limit. But we’ve observed that this isn’t the best strategy because models usually perform better with less context — splitting a complex task down into multiple smaller tasks will usually yield better results.

On top of this, models with very large context windows are sometimes compromised in the quality of their responses. For instance, the recent GPT-4 Turbo has been shown to perform worse for certain tasks than the previous model, which had a smaller context window. This is suspected to be due to the model being trained on a smaller knowledge base.

It's crucial for the chosen LLM to rely solely on the provided context to prevent misleading 'hallucinations.' So configure it to ensure that it can say “I don’t know” if it has a low confidence in its own response.

For legal use cases, because there’s anxiety that absolutely no mistakes are made, it can lead to legal teams manually checking the LLMs output with a fine tooth comb, removing much of the time saving. Here’s what you can do to avoid this:

Break the task down into smaller steps and use the LLM for less critical parts — for example, summarisation, suggesting wording, or templating. For these tasks, you can usually use a smaller and less sophisticated model.
Have a human craft the response, but use the LLM as an additional layer of error detection, reducing the potential for costly mistakes.

What you need to remember is that LLMs excel in certain areas, less so in others.

For legal services, they are great at: processing vast information volumes beyond human capacity, such as finding relevant information in a large volume of documents where keyword search is not sufficient.

They also are great at highlighting potential issues in minor claims that you might otherwise automatically approve because a review is not economically viable, redacting sensitive information in documents that require understanding of context such as medical information and classifying documents.

The imperative of flexibility

Another mistake legal professionals can often make is to restrict themselves with inflexible infrastructures. In-house operational leaders may not see this now, because they are experimenting with lots of models. But once they start prioritizing one, that’s a dangerous path.

This is especially so when a new model emerges, showing superior capabilities or niche specializations.

Adapting to these changes with an inelastic system would mean a painful and expensive overhaul. The solution?

Choose a multimodal strategy from the outset and opt for an infrastructure that facilitates model swaps with minimal customisation. Any LLM provider that can’t give you this — or at least a roadmap for this — isn’t going to serve you well for the long term.

Anticipating the evolution of LLMs

In the coming years, we think the market will stop living and dying by OpenAI’s board movements. We can expect diverse LLM methodologies to emerge, driving down costs.

Market leaders will change; while it's GPT today, Alpaca might lead tomorrow. And so the most important thing businesses can do is to build in the knowledge and flexibility to switch between these options instead of focusing on size.

The concept of "forking" in LLMs will also become prominent. This involves adapting a base model like Llama for specific applications, such as Scottish law or unique financial challenges. As the field evolves, we might see hundreds of thousands of forks, each crafted for distinct roles.

To prepare for this, law firms and in-house legal teams should champion testing with various models and prioritizing modularity, ensuring diverse model compatibility.

They should also align the capabilities of today's models with both current and foreseeable demands, and project costs on that basis – for right now, but also for in two years’ time when it’s rolled out to the whole company.

These are the only ways legal professionals can both jump on the LLM wave and avoid excruciating costs.