How to structure your answer

Define BLEU as a metric for evaluating machine-generated text by comparing it to human references. Explain its use of n-gram precision, brevity penalty, and geometric mean of overlapping n-grams. Highlight its application in machine translation and limitations, such as ignoring word order and semantic meaning.

Sample answer

The BLEU (Bilingual Evaluation Understudy) metric evaluates the quality of machine-generated text by comparing it to one or more human reference texts. It calculates precision based on overlapping n-grams (sequences of 1–4 words) between the generated text and references, then applies a geometric mean to balance n-gram contributions. A brevity penalty is added if the generated text is shorter than the reference, penalizing overly concise outputs. BLEU is widely used in machine translation tasks, such as evaluating systems in the WMT competitions. However, it has limitations: it does not account for word order beyond n-grams, ignores semantic meaning, and may favor shorter outputs if brevity penalties are not properly applied. While BLEU provides a quick, objective measure of lexical similarity, it should be complemented with human evaluation for comprehensive quality assessment.

Key points to mention

• BLEU metric
• n-gram precision
• brevity penalty
• machine translation evaluation

Common mistakes to avoid

✗ Confusing BLEU with ROUGE or other evaluation metrics.
✗ Overlooking the brevity penalty component.
✗ Failing to explain how n-grams are used for comparison.

What is the BLEU metric, and how does it evaluate the quality of machine-generated text in natural language processing tasks?

How to structure your answer

Sample answer

Key points to mention

Common mistakes to avoid