The BLEU score was developed for evaluating the predictions made by automatic machine translation systems. It is not perfect, but does offer 5 compelling benefits:

  1. It is quick and inexpensive to calculate.
  2. It is easy to understand.
  3. It is language independent.
  4. It correlates highly with human evaluation.
  5. It has been widely adopted.

In addition to translation, we can use the BLEU score for other language generation problems with deep learning methods such as:

  1. Language generation.
  2. Image caption generation.
  3. Text summarization.
  4. Speech recognition.
  • NLTK experiments for BLEU understanding: The BLEU score calculations in NLTK allow you to specify the weighting of different n-grams in the calculation of the BLEU score. Cumulative N-Gram Scores refer to the calculation of individual n-gram scores at all orders from 1 to n and weighting them by calculating the weighted geometric mean.
  1. The weights for the BLEU-4 are 1/4 (25%) or 0.25 for each of the 1-gram, 2-gram, 3-gram and 4-gram scores. For example:
# 4-gram cumulative BLEU
from nltk.translate.bleu_score import sentence_bleu
reference = [['this', 'is', 'small', 'test']]
candidate = ['this', 'is', 'a', 'test']
score = sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25))
print(score)
  1. Calculating the cumulative scores for different BLEU-1, BLEU-2, BLEU-3 and BLEU-4 scores:
# cumulative BLEU scores
from nltk.translate.bleu_score import sentence_bleu
reference = [['this', 'is', 'small', 'test']]
candidate = ['this', 'is', 'a', 'test']
print('Cumulative 1-gram: %f' % sentence_bleu(reference, candidate, weights=(1, 0, 0, 0)))
print('Cumulative 2-gram: %f' % sentence_bleu(reference, candidate, weights=(0.5, 0.5, 0, 0)))
print('Cumulative 3-gram: %f' % sentence_bleu(reference, candidate, weights=(0.33, 0.33, 0.33, 0)))
print('Cumulative 4-gram: %f' % sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25)))
  1. Getting BLEU score example:
from nltk.translate.bleu_score import sentence_bleu
reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']]
candidate = ['the', 'fast', 'brown', 'fox', 'jumped', 'over', 'the', 'sleepy', 'dog']
score = sentence_bleu(reference, candidate)
print(score)

Ref:

  1. BLEU: a Method for Automatic Evaluation of Machine Translation