Skip to main content

Command Palette

Search for a command to run...

Exploring Quality and Safety for LLM Applications – Lesson 2

Updated
3 min read
Exploring Quality and Safety for LLM Applications – Lesson 2
M

Hi everyone! I'm Mojtaba Maleki, an AI Researcher and Software Engineer at The IT Solutions Hungary. Born on February 11, 2002, I hold a BSc in Computer Science from the University of Debrecen. I'm passionate about creating smart, efficient systems, especially in the fields of Machine Learning, Natural Language Processing, and Full-Stack Development. Over the years, I've worked on diverse projects, from intelligent document processing to LLM-based assistants and scalable cloud applications. I've also authored four books on Computer Science, earned industry-recognized certifications from Google, Meta, and IBM, and contributed to research projects focused on medical imaging and AI-driven automation. Outside of work, I enjoy learning new things, mentoring peers, and yes, I'm still a great cook. So whether you need help debugging a model or seasoning a stew, I’ve got you covered!

Exploring Quality and Safety for LLM Applications – Lesson 2

Learning is a Blast! 🚀

Hey everyone! I’m back with another exciting deep dive into Quality and Safety for LLM Applications, the free course on Coursera by DeepLearning.AI. Last time, we explored dataset setup, prompt-response relevance, data leakage, and more. If you missed that, check out my previous post! Now, let’s jump into Lesson 2, where we evaluate LLM-generated responses using BLEU scores, BERT scores, and self-similarity metrics.


📌 Setting Up the Playground

First things first, let’s import the necessary libraries and load the dataset:

import helpers
import evaluate
import pandas as pd

pd.set_option('display.max_colwidth', None)

chats = pd.read_csv("./chats.csv")

Now, our chats DataFrame is ready! It contains prompts and responses that we’ll analyze.


🎯 Measuring Prompt-Response Relevance

1️⃣ BLEU Score – How Well Do Responses Match Prompts?

The BLEU (Bilingual Evaluation Understudy) score measures the similarity between generated responses and prompts. We compute it as follows:

bleu = evaluate.load("bleu")

bleu.compute(predictions=[chats.loc[2, "response"]], 
             references=[chats.loc[2, "prompt"]], 
             max_order=2)

This calculates the BLEU score for the response at index 2, comparing it to its corresponding prompt. But let’s scale it up!

from whylogs.experimental.core.udf_schema import register_dataset_udf

@register_dataset_udf(["prompt", "response"], "response.bleu_score_to_prompt")
def bleu_score(text):
  scores = []
  for x, y in zip(text["prompt"], text["response"]):
    scores.append(
      bleu.compute(
        predictions=[x], 
        references=[y], 
        max_order=2
      )["bleu"]
    )
  return scores

This function registers BLEU score calculation as a WhyLogs metric, allowing us to analyze it across the dataset.

Let’s visualize:

helpers.visualize_langkit_metric(
    chats, 
    "response.bleu_score_to_prompt", 
    numeric=True)

And check low-scoring responses:

helpers.show_langkit_critical_queries(
    chats, 
    "response.bleu_score_to_prompt", 
    ascending=True)

2️⃣ BERT Score – Semantic Similarity Matters!

BLEU only looks at exact word overlaps, but BERT Score leverages deep learning to assess semantic similarity:

bertscore = evaluate.load("bertscore")

bertscore.compute(
    predictions=[chats.loc[2, "prompt"]],
    references=[chats.loc[2, "response"]],
    model_type="distilbert-base-uncased")

We extend this across the dataset:

@register_dataset_udf(["prompt", "response"], "response.bert_score_to_prompt")
def bert_score(text):
  return bertscore.compute(
      predictions=text["prompt"].to_numpy(),
      references=text["response"].to_numpy(),
      model_type="distilbert-base-uncased"
    )["f1"]

Now, let’s visualize it:

helpers.visualize_langkit_metric(
    chats, 
    "response.bert_score_to_prompt", 
    numeric=True)

Identify weak responses:

helpers.show_langkit_critical_queries(
    chats, 
    "response.bert_score_to_prompt", 
    ascending=True)

Apply UDFs to annotate the dataset:

from whylogs.experimental.core.udf_schema import udf_schema

annotated_chats, _ = udf_schema().apply_udfs(chats)

Filter for potential hallucinations (where scores are low):

helpers.evaluate_examples(
  annotated_chats[annotated_chats["response.bert_score_to_prompt"] <= 0.75],
  scope="hallucination")

🔍 Response Self-Similarity – Detecting Repetitive or Contradictory Outputs

We load an extended dataset with multiple response variations:

chats_extended = pd.read_csv("./chats_extended.csv")
chats_extended.head(5)

1️⃣ Sentence Embedding Cosine Distance

We use sentence embeddings to measure response similarity:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

Registering a self-similarity function:

from sentence_transformers.util import pairwise_cos_sim

@register_dataset_udf(["response", "response2", "response3"], 
                      "response.sentence_embedding_selfsimilarity")
def sentence_embedding_selfsimilarity(text):
  response_embeddings = model.encode(text["response"].to_numpy())
  response2_embeddings = model.encode(text["response2"].to_numpy())
  response3_embeddings = model.encode(text["response3"].to_numpy())

  cos_sim_with_response2 = pairwise_cos_sim(
    response_embeddings, response2_embeddings
    )
  cos_sim_with_response3  = pairwise_cos_sim(
    response_embeddings, response3_embeddings
    )

  return (cos_sim_with_response2 + cos_sim_with_response3) / 2

Visualize:

helpers.visualize_langkit_metric(
    chats_extended, 
    "response.sentence_embedding_selfsimilarity", 
    numeric=True)

Check critical queries:

helpers.show_langkit_critical_queries(
    chats_extended, 
    "response.sentence_embedding_selfsimilarity", 
    ascending=True)

2️⃣ LLM Self-Evaluation – Can the AI Score Itself?

We use OpenAI’s GPT to self-assess response consistency:

import openai
import helpers

openai.api_key = helpers.get_openai_key()
openai.base_url = helpers.get_openai_base_url()

Self-evaluation function:

def prompt_single_llm_selfsimilarity(dataset, index):
    return openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[{
            "role": "system",
            "content": f"""You will be provided with a text passage \
            and your task is to rate its consistency with the provided context...
            """
        }]
    )

Filter low-consistency responses:

chats_extended[
chats_extended["response.prompted_selfsimilarity"] <= 0.8
]

🔗 Wrapping Up

That was an incredible session! We learned how to evaluate LLM responses for relevance, semantic similarity, and self-consistency. Shoutout to DeepLearning.AI for this amazing course! 🏆

Until next time, happy learning! 🚀

More from this blog

Learn From My Devlog, Tips and Tricks for Becoming a Better Developer

36 posts

Back-end Developer at The IT Solutions. I build scalable AI tools with Django & friends. Tech enthusiast, lifelong learner, and coffee-fueled coder ☕ based in Debrecen, Hungary.