Select the appropriate test suite - which tests the model.
We offer this test suite:
Test suite | Purpose | Best for | Tasks |
---|---|---|---|
Standard | Evaluate models using your own dataset. | Internal benchmarks, domain-specific tasks (e.g., finance, medical, …) | - Test similarity - BLEU - Fuzzy match - ROUGE-1 - ROUGE-2 - ROUGE-L - ROUGE-LSUM |
Nejumi Leaderboard 3 | Benchmark LLMs, especially for Japanese language tasks. Reference: Nejumi Leaderboard 3 |
Comparing LLMs on Japanese language tasks. | - Jaster - JBBQ - JtruthfulQA |
LM Evaluation Harness | General framework to benchmark language models across many standard NLP benchmarks. Reference: LM Evaluation Harness |
Evaluating English-centric LLMs and ensuring comparability with research literature | - ARC - GSM8K - HellaSwag - HumanEval - IFEval - LAMBADA - MMLU - OpenBookQA - PIQA - SciQ - TruthfulQA - WinoGrande |
VLM Evaluation Kit | Evaluate VLMs (Vision-Language Models) on multimodal tasks. Reference: VLMEvalKit |
Testing multimodal models | - ChartQA - DocVQA - InfoVQA - MTVQA - OCRBench |