Can we detect LLM hallucinations? — A quick review of our experiments

Do not index

Canonical URL

Since the beginning of NannyML, our approach to tackling new problems has always relied on doing extensive literature research first, understanding the current state-of-the-art (SoTA), and implementing a handful of the most promising methods. Then, if the researched methods don’t fully solve the issue, we try to come up with our own solution.

We followed this approach when addressing the multivariate data drift detection problem and when considering how to quantify data drift's impact on model performance accurately. This approach led to the development of best-in-class performance estimation methods such as DLE, CBPE, and M-CBPE.

This time, we are looking at the hottest topic in town: LLMs.

In the past few months, we have been delving (hehe 🤭) into LLM hallucination detection. The team has read dozens of papers on the topic. We picked four of the most promising ones, and here, we’ll share some of our results and the methodology for two of them.

Hallucination detection algorithm types

In this post, we will divide LLM hallucination detection methods into two categories: LLM-based and Uncertainty-based

LLM-based: When we use an LLM to evaluate the output response of another LLM. We can do this by measuring the consistency of the generated output given a prompt or by asking an LLM to rate a generated output on an arbitrary scale.

Uncertainty-based: When we use the notion of uncertainty to evaluate the quality of an LLM's output. Typically, we use the predicted probabilities of the generated output token and compute a summary metric to give a sense of uncertainty on the generated answer.

Let’s explore these family methods further. For each, we will explain two SoTA methods, share one of their implementations, and review some preliminary results.

Research on LLM-based methods

SelfCheckGPT — Measuring consistency between answers

Paper: SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

SelfCheckGPT is a black-box method for hallucination detection that strongly correlates with human annotations.

Image taken from the SelfCheckGPT paper: https://arxiv.org/pdf/2303.08896

The idea is to ask GPT, or any other powerful LLM, to sample multiple answers for the same prompt and then ask if these answers align with the statements in the original output. Make it say yes/no and measure the frequency with which the generated samples support the original statements.

This method, developed by researchers from the University of Cambridge, is currently one of the best-performing LLM-based hallucination detection methods. However, it has a downside: We have to make many LLM calls to evaluate a single generated response.

LLM-Eval — Asking an LLM to rate the response

Paper: LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models

The general idea of this paper is to design an evaluation scheme in which we make an extra call to an LLM and ask it to rate, on an arbitrary scale (e.g., 0 to 5), how good a generated text is on a set of different criteria.

In the paper, researchers from the National Taiwan University compared how this method applied to different closed-source LLMs correlates with human judgments on various datasets and criteria.

Table taken LLM-eval paper: https://arxiv.org/abs/2305.13711

We can see how the APP (Appropriateness) criteria of Anthropic Claude correlates nicely with human judgments.

Internal experiments: LLM-Eval results

We evaluated how this method performs on different datasets. I’ll present the specific results from the WikiQA dataset, a publicly available set of question/answer pairs collected for open-domain question-answering research.

A single record of the WikiQA dataset looks like this:

question_id	question	document_title	answer (generated)	label
Q4	how a water pump works	Pump	A large, electrically driven pump (electropump) for waterworks near the Hengsteysee , Germany .	0

where the label column refers to 1 if the answer is factual and 0 otherwise.

Unfortunately, if we want to test the LLM-eval method under a set of different criteria, such as Correctness, Relevance, and Informativeness, we need to manually label all WikiQA’s records again. This is necessary to be able to correlate LLM-eval results against them.

So, the first task was to create a golden dataset. Given a question on the WikiQA dataset, we asked gpt-3.5-turbo to generate an answer; we then created three evaluation categories (Correctness, Relevance, and Informative) and provided a human score from 0 to 5 on each. In total, we labeled 243 question/answer pairs.

After that, we used gpt-3.5-turbo once again, but this time, we asked it to provide evaluations from 0 to 5 on each of the previously generated answers for each of the three evaluation categories.

That gives us a dataset that looks like this:

question_id	question	document_title	answer (gpt-3.5-turbo)	gt_correctness	gt_relevance	gt_informativeness	llm_correctness	llm_relevance	llm_informativeness
Q4	how a water pump works	Pump	A water pump is a device that is primarily used to circulate water in various systems, such as car engines, power plants, fountains, and cooling systems. There are different types of water pumps, but the most common type is called a centrif	3.5	4.0	5.0	4.0	5.0	4.0

Columns with the ‘gt_’ prefix refer to ground truth (human evaluations), and those with the ‘llm_’ prefix contain the scores given by the evaluator LLM.

We used the following function to generate prompts that rated each generated answer

    def _get_prompt(self, generated_text):

        prompt = f'''

        The output should be formatted as a JSON instance that conforms
        to the JSON schema below.

        Each criterion is defined as follows:

        correctness: the quality of being in agreement with the true facts or 
        with what is generally accepted. How correct is the information in the
        text?

        relevance: the degree to which something is related or useful to what 
        is happening or being talked about. How relevant is the generated text
        when compared with the reference text?

        informativeness: a piece of information is considered informative when
        it provides new insights, knowledge, or details that contribute to a
        better understanding of a subject or topic.

        example of the output format:
        {{
            "correctness": value,
            "relevance": value,
            "informativeness": value,
        }}

        Score the following and generated text: {generated_text} on a continuous 
        scale from 0 to 5 on each of the following categories: correctness,
        relevance, and informativeness. Use the previous definitions and your
        knowledge as an NLP researcher to make the best judgment.
        '''
        return prompt

Once we had ground truth and LLM evaluations for each of the 243 generated answers, we could move on to investigate how well the LLM evaluations correlated against human judgements.

Pearson and Spearman correlation between ground truth and LLM scores for each criterion.

These results confirm the tendency previously mentioned in the paper: a moderate positive correlation between the ground truth scores and those produced by the LLM. Given the simplicity of this method, it is excellent to see optimistic results.

A way to expand this further would be to experiment with different and more elaborate prompts for each criterion.

Research on Uncertainty-based methods

Uncertainty Quantification — Model confidence may be all you need

Paper: Looking for a Needle in a Haystack: A Comprehensive Study of Hallucinations in Neural Machine Translation

A simple average of the predicted probabilities of the output tokens seems to be an excellent benchmark for hallucination detection.

As the authors put it, this simple method works fine because LLMs tend to know what they don’t know. So, out-of-place tokens usually have low predicted probabilities when inventing random stuff. We can use those token probabilities to create a summary sequence metric that quantifies the level of uncertainty in a generated sentence.

One of the summary sequence metrics that the authors experimented with is computing the mean of the log token predicted probabilities.

Taken from Uncertainty Quantification paper https://arxiv.org/abs/2208.05309 and annotated by me.

The main advantage of this method is its simplicity. It doesn’t require additional LLM calls, so it is cheap to compute. The main downside is that we need access to the predicted token probabilities, which not all closed-source LLMs return. For example, gpt-4 doesn’t provide them, while gpt-3.5-turbo does.

Internal experiments: Uncertainty Quantification results

For these experiments, we used again the WikiQA dataset that we manually labeled before. This time, instead of comparing ground truth scores against LLM-generated ones, we need to have access to the predicted token probabilities of the generated answer.

We used the following function to get the output token predicted probabilities for each gpt-3.5-turbo generated answer.

def get_generated_token_probs(gpt_responses):
    answers = []
    average_logprobs = []
    average_probs = []
    max_logprobs = []
    max_probs = []
    gpt_probs = []
    for gpt_response in gpt_responses:
        answer = gpt_response.choices[0].message.content
        average_logprob = sum(-entry.logprob for entry in gpt_response.choices[0].logprobs.content) / len(gpt_response.choices[0].logprobs.content)
        average_prob = sum(np.exp(entry.logprob) for entry in gpt_response.choices[0].logprobs.content) / len(gpt_response.choices[0].logprobs.content)
        max_logprob = max(-entry.logprob for entry in gpt_response.choices[0].logprobs.content)
        max_prob = max(np.exp(entry.logprob) for entry in gpt_response.choices[0].logprobs.content)
        gmean_prob = gmean([np.exp(entry.logprob) for entry in gpt_response.choices[0].logprobs.content])

        answers.append(answer)
        average_logprobs.append(average_logprob)
        average_probs.append(average_prob)
        max_logprobs.append(max_logprob)
        max_probs.append(max_prob)
        gmean_probs.append(gmean_prob)
        gpt_probs.append([np.exp(entry.logprob) for entry in gpt_response.choices[0].logprobs.content])
    
    result = {
        'generated_answer': answers,
        'avg_log_prob': average_logprobs,
        'max_log_prob': max_logprobs,
        'avg_prob': average_probs,
        'max_prob': max_probs,
        'gmean_prob': gmean_probs,
        'gpt_probs': gpt_probs
    }

    return result

As you can see, we experimented with the log average of the probabilities, looked at simple averages and max probabilities, and computed the geometric mean. However, for the sake of simplicity, here we discuss only results from the standard mean, given that it performs the best for the WikiQA dataset.

Pearson and Spearman correlation between ground truth and average predicted token probabilities.

Looking at the results, we see lower correlation values than the ones we got from the LLM-Eval results. There is a slight positive correlation between the uncertainty metric and the ground truth results, especially for the Correctness criterion.

An interesting result is when we compare uncertainty quantification against LLM-Eval results. Notice that both methods correlate very weakly, potentially meaning that using both methods simultaneously in a bagging or voting approach might yield better results.

Pearson and Spearman correlation between LLM-generated scores and average predicted token probabilities.

Semantic Uncertainty — taking meaning into account

Paper: Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

Text generation tasks are challenging because a sentence can be written in multiple ways while preserving its meaning.

For instance, "France's capital is Paris" means the same as "Paris is France's capital." 🇫🇷

In LLM uncertainty quantification, we often look at token-level probabilities to quantify how "confident" an LLM is about its output. However, the authors look at uncertainty at a meaning level in this paper.

Their motivation is that meanings are essential for LLMs' trustworthiness; a system can be reliable even with many different ways to say the same thing, but answering with inconsistent meanings shows poor reliability.

To estimate semantic uncertainty, they introduce an algorithm for clustering sequences that mean the same thing based on the principle that two sentences mean the same thing if we can infer one from the other. 🔄🤝

Then, they determine the likelihood of each meaning and estimate the semantic entropy by summing probabilities that share a meaning.

Taken from the Semantic Uncertainty paper: https://arxiv.org/abs/2302.09664 and annotated by me.

Final thoughts

Looking back at the results, it is clear that this is only the beginning of research on hallucination detection. The two methods that we have implemented don’t fully solve the problem, but they offer excellent benchmarks for any other method that comes next.

Internally, we are working on a hallucination detection tool that implements LLM-Eval and Uncertainty Quantification out of the box. This will offer AI engineers a simple and quick plug-and-play LLM monitoring tool.

If you want to learn more about it and gain deeper insight into our LLM hallucination research, Wojtek Kuberski, NannyML’s CTO, is hosting a webinar on the topic.

Topic: Strategies to Monitor and Mitigate LLM Hallucinations

Date: 9th May 2024, 2 pm CET

Link: https://www.linkedin.com/events/strategiestomonitorandmitigatel7190764704871436288/comments/