First of all, you must prepare the best dataset because it directly impacts how well the model performs on your intended use case.
Here’s what good dataset quality enables:
Collect examples to target remaining issues.
Scrutinize existing examples for issues.
Consider the balance and diversity of data.
Make sure your training examples contain all of the information needed for the response.
Look at the agreement and consistency in the training examples.
Make sure all of your training examples are in the same format, as expected for inference.
You have two ways to transfer the Training data and Evaluation data:
Upload a file
Default value Upload file
Choose a local file from your computer.
(Optional) Click Download sample to see an example of the expected format.
Notice: Ensure the file matches the selected data format
Trainer | Supported data format | Supported file format | Supported file size |
---|---|---|---|
SFT | Alpaca | CSV JSON JSONLINES ZIP PARQUET |
Limit 100MB |
SFT | ShareGPT | JSON JSONLINES ZIP PARQUET |
Limit 100MB |
SFT | ShareGPT_Image | ZIP PARQUET |
Limit 100MB |
DPO | ShareGPT | JSON JSONLINES ZIP PARQUET |
Limit 100MB |
Pre-training | Corpus | TXT JSON JSONLINES ZIP PARQUET |
Limit 100MB |
Connect to Data Hub
Click Data Hub
Select a connection or dataset from the Data Hub. Notice: Ensure the dataset is compatible with the selected format.
(Optional) Click Open Data Hub to preview or manage datasets.
(Optional) Click Reload icon to update connection and dataset list.
Follow the detailed guide Data Hub