Exploring Quality and Safety for LLM Applications – Lesson 2

Hi everyone! I'm Mojtaba Maleki, an AI Researcher and Software Engineer at The IT Solutions Hungary. Born on February 11, 2002, I hold a BSc in Computer Science from the University of Debrecen. I'm passionate about creating smart, efficient systems, especially in the fields of Machine Learning, Natural Language Processing, and Full-Stack Development. Over the years, I've worked on diverse projects, from intelligent document processing to LLM-based assistants and scalable cloud applications. I've also authored four books on Computer Science, earned industry-recognized certifications from Google, Meta, and IBM, and contributed to research projects focused on medical imaging and AI-driven automation. Outside of work, I enjoy learning new things, mentoring peers, and yes, I'm still a great cook. So whether you need help debugging a model or seasoning a stew, I’ve got you covered!
Exploring Quality and Safety for LLM Applications – Lesson 2
Learning is a Blast! 🚀
Hey everyone! I’m back with another exciting deep dive into Quality and Safety for LLM Applications, the free course on Coursera by DeepLearning.AI. Last time, we explored dataset setup, prompt-response relevance, data leakage, and more. If you missed that, check out my previous post! Now, let’s jump into Lesson 2, where we evaluate LLM-generated responses using BLEU scores, BERT scores, and self-similarity metrics.
📌 Setting Up the Playground
First things first, let’s import the necessary libraries and load the dataset:
import helpers
import evaluate
import pandas as pd
pd.set_option('display.max_colwidth', None)
chats = pd.read_csv("./chats.csv")
Now, our chats DataFrame is ready! It contains prompts and responses that we’ll analyze.
🎯 Measuring Prompt-Response Relevance
1️⃣ BLEU Score – How Well Do Responses Match Prompts?
The BLEU (Bilingual Evaluation Understudy) score measures the similarity between generated responses and prompts. We compute it as follows:
bleu = evaluate.load("bleu")
bleu.compute(predictions=[chats.loc[2, "response"]],
references=[chats.loc[2, "prompt"]],
max_order=2)
This calculates the BLEU score for the response at index 2, comparing it to its corresponding prompt. But let’s scale it up!
from whylogs.experimental.core.udf_schema import register_dataset_udf
@register_dataset_udf(["prompt", "response"], "response.bleu_score_to_prompt")
def bleu_score(text):
scores = []
for x, y in zip(text["prompt"], text["response"]):
scores.append(
bleu.compute(
predictions=[x],
references=[y],
max_order=2
)["bleu"]
)
return scores
This function registers BLEU score calculation as a WhyLogs metric, allowing us to analyze it across the dataset.
Let’s visualize:
helpers.visualize_langkit_metric(
chats,
"response.bleu_score_to_prompt",
numeric=True)
And check low-scoring responses:
helpers.show_langkit_critical_queries(
chats,
"response.bleu_score_to_prompt",
ascending=True)
2️⃣ BERT Score – Semantic Similarity Matters!
BLEU only looks at exact word overlaps, but BERT Score leverages deep learning to assess semantic similarity:
bertscore = evaluate.load("bertscore")
bertscore.compute(
predictions=[chats.loc[2, "prompt"]],
references=[chats.loc[2, "response"]],
model_type="distilbert-base-uncased")
We extend this across the dataset:
@register_dataset_udf(["prompt", "response"], "response.bert_score_to_prompt")
def bert_score(text):
return bertscore.compute(
predictions=text["prompt"].to_numpy(),
references=text["response"].to_numpy(),
model_type="distilbert-base-uncased"
)["f1"]
Now, let’s visualize it:
helpers.visualize_langkit_metric(
chats,
"response.bert_score_to_prompt",
numeric=True)
Identify weak responses:
helpers.show_langkit_critical_queries(
chats,
"response.bert_score_to_prompt",
ascending=True)
Apply UDFs to annotate the dataset:
from whylogs.experimental.core.udf_schema import udf_schema
annotated_chats, _ = udf_schema().apply_udfs(chats)
Filter for potential hallucinations (where scores are low):
helpers.evaluate_examples(
annotated_chats[annotated_chats["response.bert_score_to_prompt"] <= 0.75],
scope="hallucination")
🔍 Response Self-Similarity – Detecting Repetitive or Contradictory Outputs
We load an extended dataset with multiple response variations:
chats_extended = pd.read_csv("./chats_extended.csv")
chats_extended.head(5)
1️⃣ Sentence Embedding Cosine Distance
We use sentence embeddings to measure response similarity:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
Registering a self-similarity function:
from sentence_transformers.util import pairwise_cos_sim
@register_dataset_udf(["response", "response2", "response3"],
"response.sentence_embedding_selfsimilarity")
def sentence_embedding_selfsimilarity(text):
response_embeddings = model.encode(text["response"].to_numpy())
response2_embeddings = model.encode(text["response2"].to_numpy())
response3_embeddings = model.encode(text["response3"].to_numpy())
cos_sim_with_response2 = pairwise_cos_sim(
response_embeddings, response2_embeddings
)
cos_sim_with_response3 = pairwise_cos_sim(
response_embeddings, response3_embeddings
)
return (cos_sim_with_response2 + cos_sim_with_response3) / 2
Visualize:
helpers.visualize_langkit_metric(
chats_extended,
"response.sentence_embedding_selfsimilarity",
numeric=True)
Check critical queries:
helpers.show_langkit_critical_queries(
chats_extended,
"response.sentence_embedding_selfsimilarity",
ascending=True)
2️⃣ LLM Self-Evaluation – Can the AI Score Itself?
We use OpenAI’s GPT to self-assess response consistency:
import openai
import helpers
openai.api_key = helpers.get_openai_key()
openai.base_url = helpers.get_openai_base_url()
Self-evaluation function:
def prompt_single_llm_selfsimilarity(dataset, index):
return openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[{
"role": "system",
"content": f"""You will be provided with a text passage \
and your task is to rate its consistency with the provided context...
"""
}]
)
Filter low-consistency responses:
chats_extended[
chats_extended["response.prompted_selfsimilarity"] <= 0.8
]
🔗 Wrapping Up
That was an incredible session! We learned how to evaluate LLM responses for relevance, semantic similarity, and self-consistency. Shoutout to DeepLearning.AI for this amazing course! 🏆
Until next time, happy learning! 🚀




