LLM-Evaluation

Perplexity of fixed-length models

PPL(perplexity) is one of the most common metrics for evaluating models. If we have a tokenized sequence $X = (x_{0},x_{1},\dots x_{t})$, the perplexity of X is \(P P L ( X ) = e x p \left\{ - \frac { 1 } { t } \sum _ { i } ^ { t } \log p _ { \theta } ( x _ { i } | x < i ) \right\}\) where $\log p_{\theta}(x_{i}|x<i)$ is the log-likelihood of the ith token conditioned on the proceeding token $x_{<i}$. It uses the first i-1 tokens to predict ith token.

we can just simplify the formula like this: \(PPL(X) = e^{loss}\) where loss means the CrossEntropyLoss \(L o s s = - \frac { 1 } { N } \sum _ { i = 1 } ^ { N } \log P ( y _ { i } | x )\) N: the number of the tokens $P(y_{i}|x)$ : the probability of predicting $y$ in the position of i using the previous tokens.

Using sliding window

we can set hyperparameters stride to evaluate. For example, giving the stride is 512, the total text length is 2000, max_length is 1024 first window

  • input: [0:1024] (from 0 to 1023 tokens)
  • target:[512:1024] (just mask the first 512 tokens), for example, target_ids = [-100, -100, …, -100, token_512, token_513, …, token_1023]
  • Model prediction: predict the probability distribution of the next 512 tokens through the first 1024 tokens. It is an autoregressive model. second window
  • input: [512:1536] (from 0 to 1023 tokens)
  • target:[1024:1536] (just mask the first 512 tokens) ……

BLEU

just see the website BLEU SCORE Use the N-Gram, for example

Assume:

Reference Translation: “The cat is on the mat”

Candidate Translation (Machine Translation): “The cat is on mat”

We’ll calculate the BLEU score using 1-gram and 2-gram.

Step 1: Calculate n-gram Precision

First, we calculate the matching for 1-gram and 2-gram.

1. 1-gram Precision

  • 1-gram represents single word matches.
  • 1-grams in Reference: ["The", "cat", "is", "on", "the", "mat"]
  • 1-grams in Candidate: ["The", "cat", "is", "on", "mat"]

We compare the candidate translation with the reference:

  • Matches: “The”, “cat”, “is”, “on”, “mat”

There are 5 matches, and the candidate has 5 words, so the 1-gram precision is:

$p_1 = \frac{5}{5} = 1.0$

2. 2-gram Precision

  • 2-gram represents matches of consecutive pairs of words.
  • 2-grams in Reference: ["The cat", "cat is", "is on", "on the", "the mat"]
  • 2-grams in Candidate: ["The cat", "cat is", "is on", "on mat"]

We compare the candidate and reference:

  • Matches: “The cat”, “cat is”, “is on”

There are 3 matches, and the candidate has 4 2-grams, so the 2-gram precision is:

$p_2 = \frac{3}{4} = 0.75$

Step 2: Calculate BP (Brevity Penalty)

The Brevity Penalty (BP) is used to penalize overly short translations. It is defined as:

$BP = \begin{cases} 1 & \text{if } c > r \ e^{(1 - r/c)} & \text{if } c \leq r \end{cases}$

Where:

  • c is the length of the candidate translation (number of words).
  • r is the length of the reference translation.

In our example:

  • Length of Candidate (c) = 5
  • Length of Reference (r) = 6

Since c<rc < rc<r:

$BP = e^{(1 - \frac{6}{5})} = e^{-0.2} \approx 0.8187$

Step 3: Calculate BLEU Score

Now we use the n-gram precisions and BP to calculate the BLEU score. Let’s consider using 1-gram and 2-gram precisions with equal weights (i.e., w1=w2=0.5):

$\text{BLEU} = BP \cdot \exp \left( w_1 \cdot \log(p_1) + w_2 \cdot \log(p_2) \right)$

Substituting the values:

$BLEU=0.8187⋅exp(0.5⋅log(1.0)+0.5⋅log(0.75)) =0.8187⋅exp⁡(0+0.5⋅log⁡(0.75))= 0.8187⋅exp(0+0.5⋅log(0.75))$$=0.8187⋅exp⁡(−0.14385)≈0.8187⋅0.8665≈0.7091$

Thus, the final BLEU score is approximately 0.7091, or 70.91%.

Tags: LLM
Share: X (Twitter) Facebook LinkedIn