We offer the following tasks depend on the selected test suite:
Test suite | Tasks | Description |
---|---|---|
Nejumi Leaderboard 3 | Jaster | Measure the model’s ability to understand and process the Japanese language. |
JBBQ | Measure social bias in Japanese question answering by LLMs. | |
JtruthfulQA | Measure the truthfulness of model answers to Japanese questions. | |
LM Evaluation Harness | ARC | Measure scientific reasoning on grade-school questions. |
GSM8K | Measure multi-step reasoning in math word problems. | |
HellaSwag | Measure contextual commonsense reasoning. | |
HumanEval | Measure Python code generation ability. | |
IFEval | Measure instruction-following and harmful input rejection. | |
LAMBADA | Measure long-range context understanding. | |
MMLU | Measure reasoning across 57 academic/professional subjects. | |
OpenBookQA | Measure science QA using facts and commonsense. | |
PIQA | Measure physical commonsense reasoning. | |
SciQ | Measure science multiple-choice QA for elementary & middle school levels. | |
TruthfulQA | Measure truthfulness in open-domain question answering. | |
Winogrande | Measure semantic understanding in pronoun disambiguation tasks. | |
VLM Evaluation Kit | ChartQA | Measure chart-based data interpretation and question answering skills. |
DocVQA | Measure question answering performance on document images. | |
InfoVQA | Measure question answering based on information embedded in images. | |
MTVQA | Measure multilingual visual-text question answering performance. | |
OCRBench | Measure optical character recognition accuracy across varied datasets. |