Perplexity Evaluation
This short example demonstrates how to compute Perplexity using the shared Practicus AI eval_metrics helpers.
Perplexity measures how well a language model predicts text — lower is better.
Unlike BLEU / ROUGE / METEOR, Perplexity does not require reference texts.
1. Import the metric
The PerplexityMetric uses a causal LM (default: gpt2) to compute token-level negative log-likelihood.
from eval_metrics import PerplexityMetric
# Initialize metric
perp = PerplexityMetric(model_name="gpt2")
perp
2. Define some generated text
We evaluate Perplexity on model outputs (or any text).
predictions = [
"The quick brown fox jumps over the lazy dog.",
"Practicus AI helps teams build scalable machine learning systems.",
"Large language models are transforming enterprise analytics.",
]
predictions
3. Compute Perplexity
This returns both the average perplexity and per-sample perplexity values.
4. View results as a table
This is useful for inspecting which generated outputs are harder for the LM to model.
Lower scores = better (more predictable).
import pandas as pd
df = pd.DataFrame(
{
"text": predictions,
"perplexity": result.per_sample_scores,
}
)
df
Summary
- Perplexity evaluates how well an LM predicts the text.
- Unlike BLEU/ROUGE/METEOR, no references are required.
- This same metric utility is used by Practicus AI model hosting to automatically compute and log evaluation metrics.
Previous: Meteor | Next: LLM Apps > API LLM Apphost > Build