@misc{kamigaito2025diversityexplainsinferencescaling,
title={Diversity Explains Inference Scaling Laws: Through a Case Study of Minimum Bayes Risk Decoding},
author={Hidetaka Kamigaito and Hiroyuki Deguchi and Yusuke Sakai and Katsuhiko Hayashi and Taro Watanabe},
year={2025},
eprint={2410.15021},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.15021},
}
The results for each setting are located in ./output/(full|1000)/results
.
The generated samples are located in ./output/(full|1000)/samples
.
The generated texts by MBR decoding are located in ./output/(full|1000)/decoded
.
Due to the space limitation of GitHub, we put its directories on our Google Drive.
python3.11 -m venv venv
First, you need to prepare datasets for each task in model-based-mbr/dataset
Example: WMT'19 En-De Use sacrebleu to prepare the benchmark dataset.
cd ./model-based-mbr
mkdir -p ./dataset/wmt19-text
sacrebleu -t wmt19 -l en-de --echo src > ./dataset/wmt19-text/wmt19.en-de.en
sacrebleu -t wmt19 -l en-de --echo ref > ./dataset/wmt19-text/wmt19.en-de.de
Note that about the case of CNN/DailyMail and XSum datasets, you can refer to the explanations in (https://github.com/huggingface/transformers/blob/main/examples/legacy/seq2seq/README.md).
Then, setup required modules:
cd ./model-based-mbr
../venv/bin/pip3.11 install -r requirements.txt
After that, you can generate samples for each setting by the following commands:
cd ./model-based-mbr
# Full samples
bash ./jobs/all/(ancestral|beam|epsilon|nuclues|topk).sh (mscoco-ft|nocaps|samsum|wmt19.en-de|wmt19.en-ru|xsum)
# 1000 samples
bash ./jobs/1000/(ancestral|beam|epsilon|nuclues|topk).sh (mscoco-ft|nocaps|samsum|wmt19.en-de|wmt19.en-ru|xsum)
The generated results will be stored in model-based-mbr/samples/
.
First, you need to setup following modules
cd ./mbrs
../venv/bin/pip3.11 install -e .
cd ./transformers
../venv/bin/pip3.11 install -e .
cd ./bert_score
../venv/bin/pip3.11 install -e .
Then, you can run MBR decoding by the following commands:
cd mbrs
bash ./jobs/(1000|full)/(mscoco-ft|nocaps|samsum|wmt19.en-de|wmt19.en-ru|xsum)/(ancestral|beam|epsilon|nuclues|topk)_(004|008|016|032|064).sh
First, you need to finetune models:
BERTScore
./scripts/finetune.sh [0-7]
COMET
cd models/comet
./scripts/download.sh
./scripts/finetune.sh [0-7]
You can run MAMBR decoding by the following commands:
cd mbrs
bash ./jobs/(1000|full)/(mscoco-ft|nocaps|samsum|wmt19.en-de|wmt19.en-ru|xsum)/(ancestral|beam|epsilon|nuclues|topk)_(004|008|016|032|064)_(001|002|004|008).sh
You can reproduce our results by running the following scripts:
# Bias and diversity results in main part
./scripts/evaluate_bias_diversity.sh
# MAMBE results
./scripts/evaluate_mambr.sh
# Bias and diversity results in appendices
./scripts/evaluate_bias_diversity_sup.sh
When you need to obtain the results on spot, you should see the following subsections.
# MBR decoding
./venv/bin/python3.11 ./scripts/evaluate/mt.py --input=./output/(1000|full)/decoded/(wmt19.en-de|wmt19.en-ru)/(ancestral|beam|epsilon|nuclues|topk)_(004|008|016|032|064).jsonl.gz --output=./output/(1000|full)/results/(wmt19.en-de|wmt19.en-ru)/(ancestral|beam|epsilon|nuclues|topk)_(004|008|016|032|064).jsonl
# MAMBR decoding
./venv/bin/python3.11 ./scripts/evaluate/mt.py --input=./output/(1000|full)/decoded/(wmt19.en-de|wmt19.en-ru)/(ancestral|beam|epsilon|nuclues|topk)_(004|008|016|032|064)_(001|002|004|008).jsonl.gz --output=./output/(1000|full)/results/(wmt19.en-de|wmt19.en-ru)/(ancestral|beam|epsilon|nuclues|topk)_(004|008|016|032|064)_(001|002|004|008).jsonl
# MBR decoding
./venv/bin/python3.11 ./scripts/evaluate/captioning.py --input=./output/(1000|full)/decoded/(mscoco-ft|nocaps)/(ancestral|beam|epsilon|nuclues|topk)_(004|008|016|032|064).jsonl.gz --output=./output/(1000|full)/results/(mscoco-ft|nocaps)/(ancestral|beam|epsilon|nuclues|topk)_(004|008|016|032|064).jsonl
# MAMBR decoding
./venv/bin/python3.11 ./scripts/evaluate/captioning.py --input=./output/(1000|full)/decoded/(mscoco-ft|nocaps)/(ancestral|beam|epsilon|nuclues|topk)_(004|008|016|032|064)_(001|002|004|008).jsonl.gz --output=./output/(1000|full)/results/(mscoco-ft|nocaps)/(ancestral|beam|epsilon|nuclues|topk)_(004|008|016|032|064)_(001|002|004|008).jsonl
# MBR decoding
./venv/bin/python3.11 ./scripts/evaluate/summarization.py --input=./output/(1000|full)/decoded/(samsum|xsum)/(ancestral|beam|epsilon|nuclues|topk)_(004|008|016|032|064).jsonl.gz --output=./output/(1000|full)/results/(samsum|xsum)/(ancestral|beam|epsilon|nuclues|topk)_(004|008|016|032|064).jsonl
# MAMBR decoding
./venv/bin/python3.11 ./scripts/evaluate/summarization.py --input=./output/(1000|full)/decoded/(samsum|xsum)/(ancestral|beam|epsilon|nuclues|topk)_(004|008|016|032|064)_(001|002|004|008).jsonl.gz --output=./output/(1000|full)/results/(samsum|xsum)/(ancestral|beam|epsilon|nuclues|topk)_(004|008|016|032|064)_(001|002|004|008).jsonl
We modified the following versions of modules:
- mbrs: bdadd3a37ea8d97cb802c80daa57c8bba0f95a7b
- model-based-mbr: 445c6123c1afe8ba10a91c3d91e104f2b0ceb8ea
- bert_score: 19e7f551fe4fa43fdd07b8129ae947015b902b2d
- transformers: c409cd81777fb27aadc043ed3d8339dbc020fb3b