Create new job
Create new job
Updated on 10 Jun 2025

Access the Model Testing service and choose Test Jobs tab, click button "Create New Job" file

Step 1: Select a model

file

  • Model Source: Select where your model is hosted

    You can choose between two model sources when configuring your setup:

    • Catalog
      • When Catalog is selected as the source, the system will display a list of all available models from the Model Catalog.
      • This option is suitable when you want to browse or select from predefined, public models.
    • Private Model
      • When Private Model is selected, only your organization’s custom or privately uploaded models will be shown.
      • Select a model and make sure to select the correct version you want to test.
      • Use this option if you want to work with models that are not publicly listed in the catalog.
  • Model Name: Select the model that you want to test

Step 2: Test suite settings

file

A test suite is a collection of test cases designed to evaluate the performance, accuracy, and stability of a machine learning model.

  • Test suite: Select an evaluation benchmark or leader board task to auto-fill test settings

    Support is currently limited to the Nejumi Leaderboard 3 - benchmarking Japanese LLMs on language skills and alignment, using diverse datasets for performance and safety evaluation.

  • Tasks: Choose more specific tasks of the test suite that you selected.

    Nejumi Leaderboard 3 provides specific tasks such as Llm-jp-eval (jaster), JBBQ and JTruthfulQA. You can choose to view results for all tasks or select a single task

    • Llm-jp-eval (jaster): A benchmark dataset designed to evaluate Japanese language models on a wide range of general language processing tasks, including reading comprehension, reasoning, and semantic understanding.
    • JBBQ: A dataset focused on detecting and measuring biases in Japanese language models, evaluating fairness and sensitivity to biased or harmful content.
    • JTruthfulQA: JTruthfulQA is a QA dataset designed to measure the truthfulness a Japanese language model outputs.

Hover over each option to view detailed information about the tasks

Step 3: Set parameters

file

Parameters Description Default value
Log samples Choose whether to save model’s inputs and outputs to review True
Max tokens Set the maximum number of tokens that model can generate. 1024
Few-shot Set the number of few-shot examples to place in context. 0
Temperature Set the randomness level for model's output 0.00
Repetition penalty Adjust how much to penalize repeated tokens to encourage diverse output. 1.00
Seed Enter a random seed to ensure consistent results across runs. 1308
Top-K Limit token selection to top-k most likely options. -1
Top-P Set the probability threshold for selecting the next token. 1.00

Step 4: Select GPU configuration

file

Select the GPU configuration for running your test job.

  • 1 x GPU NVIDIA H100 SXM5 (16CPU - 192GB RAM - 1xH100)
  • 2 x GPU NVIDIA H100 SXM5 (32CPU - 384GB RAM - 2xH100)
  • 4 x GPU NVIDIA H100 SXM5 (64CPU - 768GB RAM - 4xH100)
  • 8 x GPU NVIDIA H100 SXM5 (128CPU - 1536GB RAM - 8xH100)

Step 5: Finish & Review

file

  • Enter Job Name
    • Default format: ft_[base model]_[timestamp]
    • Editable with a 50-character limit
  • Enter Job Description (Max 200 characters)
  • Notification: Choose how you want to receive run results - email
  • Click “Save

You can manually run the job by clicking the Run button

file