(Others) Tasks
(Others) Tasks
Updated on 01 Oct 2025

Alt text

We offer the following tasks depend on the selected test suite:

Test suite Tasks Description
Nejumi Leaderboard 3 Jaster Measure the model’s ability to understand and process the Japanese language.
JBBQ Measure social bias in Japanese question answering by LLMs.
JtruthfulQA Measure the truthfulness of model answers to Japanese questions.
LM Evaluation Harness ARC Measure scientific reasoning on grade-school questions.
GSM8K Measure multi-step reasoning in math word problems.
HellaSwag Measure contextual commonsense reasoning.
HumanEval Measure Python code generation ability.
IFEval Measure instruction-following and harmful input rejection.
LAMBADA Measure long-range context understanding.
MMLU Measure reasoning across 57 academic/professional subjects.
OpenBookQA Measure science QA using facts and commonsense.
PIQA Measure physical commonsense reasoning.
SciQ Measure science multiple-choice QA for elementary & middle school levels.
TruthfulQA Measure truthfulness in open-domain question answering.
Winogrande Measure semantic understanding in pronoun disambiguation tasks.
VLM Evaluation Kit ChartQA Measure chart-based data interpretation and question answering skills.
DocVQA Measure question answering performance on document images.
InfoVQA Measure question answering based on information embedded in images.
MTVQA Measure multilingual visual-text question answering performance.
OCRBench Measure optical character recognition accuracy across varied datasets.