Exploring Quality and Safety for LLM Applications – Lesson 2

Learning is a Blast! 🚀

Hey everyone! I’m back with another exciting deep dive into Quality and Safety for LLM Applications, the free course on Coursera by DeepLearning.AI. Last time, we explored dataset setup, prompt-response relevance, data leakage, and more. If you missed that, check out my previous post! Now, let’s jump into Lesson 2, where we evaluate LLM-generated responses using BLEU scores, BERT scores, and self-similarity metrics.

📌 Setting Up the Playground

First things first, let’s import the necessary libraries and load the dataset:

import helpers
import evaluate
import pandas as pd

pd.set_option('display.max_colwidth', None)

chats = pd.read_csv("./chats.csv")

Now, our chats DataFrame is ready! It contains prompts and responses that we’ll analyze.

🎯 Measuring Prompt-Response Relevance

1️⃣ BLEU Score – How Well Do Responses Match Prompts?

The BLEU (Bilingual Evaluation Understudy) score measures the similarity between generated responses and prompts. We compute it as follows:

bleu = evaluate.load("bleu")

bleu.compute(predictions=[chats.loc[2, "response"]], 
             references=[chats.loc[2, "prompt"]], 
             max_order=2)

This calculates the BLEU score for the response at index 2, comparing it to its corresponding prompt. But let’s scale it up!

from whylogs.experimental.core.udf_schema import register_dataset_udf

@register_dataset_udf(["prompt", "response"], "response.bleu_score_to_prompt")
def bleu_score(text):
  scores = []
  for x, y in zip(text["prompt"], text["response"]):
    scores.append(
      bleu.compute(
        predictions=[x], 
        references=[y], 
        max_order=2
      )["bleu"]
    )
  return scores

This function registers BLEU score calculation as a WhyLogs metric, allowing us to analyze it across the dataset.

Let’s visualize:

helpers.visualize_langkit_metric(
    chats, 
    "response.bleu_score_to_prompt", 
    numeric=True)

And check low-scoring responses:

helpers.show_langkit_critical_queries(
    chats, 
    "response.bleu_score_to_prompt", 
    ascending=True)

2️⃣ BERT Score – Semantic Similarity Matters!

BLEU only looks at exact word overlaps, but BERT Score leverages deep learning to assess semantic similarity:

bertscore = evaluate.load("bertscore")

bertscore.compute(
    predictions=[chats.loc[2, "prompt"]],
    references=[chats.loc[2, "response"]],
    model_type="distilbert-base-uncased")

We extend this across the dataset:

@register_dataset_udf(["prompt", "response"], "response.bert_score_to_prompt")
def bert_score(text):
  return bertscore.compute(
      predictions=text["prompt"].to_numpy(),
      references=text["response"].to_numpy(),
      model_type="distilbert-base-uncased"
    )["f1"]

Now, let’s visualize it:

helpers.visualize_langkit_metric(
    chats, 
    "response.bert_score_to_prompt", 
    numeric=True)

Identify weak responses:

helpers.show_langkit_critical_queries(
    chats, 
    "response.bert_score_to_prompt", 
    ascending=True)

Apply UDFs to annotate the dataset:

from whylogs.experimental.core.udf_schema import udf_schema

annotated_chats, _ = udf_schema().apply_udfs(chats)

Filter for potential hallucinations (where scores are low):

helpers.evaluate_examples(
  annotated_chats[annotated_chats["response.bert_score_to_prompt"] <= 0.75],
  scope="hallucination")

🔍 Response Self-Similarity – Detecting Repetitive or Contradictory Outputs

We load an extended dataset with multiple response variations:

chats_extended = pd.read_csv("./chats_extended.csv")
chats_extended.head(5)

1️⃣ Sentence Embedding Cosine Distance

We use sentence embeddings to measure response similarity:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

Registering a self-similarity function:

from sentence_transformers.util import pairwise_cos_sim

@register_dataset_udf(["response", "response2", "response3"], 
                      "response.sentence_embedding_selfsimilarity")
def sentence_embedding_selfsimilarity(text):
  response_embeddings = model.encode(text["response"].to_numpy())
  response2_embeddings = model.encode(text["response2"].to_numpy())
  response3_embeddings = model.encode(text["response3"].to_numpy())

  cos_sim_with_response2 = pairwise_cos_sim(
    response_embeddings, response2_embeddings
    )
  cos_sim_with_response3  = pairwise_cos_sim(
    response_embeddings, response3_embeddings
    )

  return (cos_sim_with_response2 + cos_sim_with_response3) / 2

Visualize:

helpers.visualize_langkit_metric(
    chats_extended, 
    "response.sentence_embedding_selfsimilarity", 
    numeric=True)

Check critical queries:

helpers.show_langkit_critical_queries(
    chats_extended, 
    "response.sentence_embedding_selfsimilarity", 
    ascending=True)

2️⃣ LLM Self-Evaluation – Can the AI Score Itself?

We use OpenAI’s GPT to self-assess response consistency:

import openai
import helpers

openai.api_key = helpers.get_openai_key()
openai.base_url = helpers.get_openai_base_url()

Self-evaluation function:

def prompt_single_llm_selfsimilarity(dataset, index):
    return openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[{
            "role": "system",
            "content": f"""You will be provided with a text passage \
            and your task is to rate its consistency with the provided context...
            """
        }]
    )

Filter low-consistency responses:

chats_extended[
chats_extended["response.prompted_selfsimilarity"] <= 0.8
]

🔗 Wrapping Up

That was an incredible session! We learned how to evaluate LLM responses for relevance, semantic similarity, and self-consistency. Shoutout to DeepLearning.AI for this amazing course! 🏆

Until next time, happy learning! 🚀

Exploring Quality and Safety for LLM Applications – Lesson 2

Exploring Quality and Safety for LLM Applications – Lesson 2

Learning is a Blast! 🚀

📌 Setting Up the Playground

🎯 Measuring Prompt-Response Relevance

1️⃣ BLEU Score – How Well Do Responses Match Prompts?

2️⃣ BERT Score – Semantic Similarity Matters!

🔍 Response Self-Similarity – Detecting Repetitive or Contradictory Outputs

1️⃣ Sentence Embedding Cosine Distance

2️⃣ LLM Self-Evaluation – Can the AI Score Itself?

🔗 Wrapping Up

Comments

Quality and Safety for LLM Applications

Data Leakage and Toxicity in LLM Applications

More from this blog

LLM's are not good for translation, "yet"

Implementing Self-Editing Memory in LLM Agents (From Scratch)

From Pixels to Paragraphs, Embeddings and Vector Spaces

Langflow

Exploring Passive and Active Monitoring for LLMs

Command Palette

Exploring Quality and Safety for LLM Applications – Lesson 2

Learning is a Blast! 🚀

📌 Setting Up the Playground

🎯 Measuring Prompt-Response Relevance

1️⃣ BLEU Score – How Well Do Responses Match Prompts?

2️⃣ BERT Score – Semantic Similarity Matters!

🔍 Response Self-Similarity – Detecting Repetitive or Contradictory Outputs

1️⃣ Sentence Embedding Cosine Distance

2️⃣ LLM Self-Evaluation – Can the AI Score Itself?

🔗 Wrapping Up

Comments

Quality and Safety for LLM Applications

Data Leakage and Toxicity in LLM Applications

More from this blog