100+ datasets found

h
example-generate-preference-dataset
huggingface.co
Updated Aug 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
distilabel-internal-testing (2024). example-generate-preference-dataset [Dataset]. https://huggingface.co/datasets/distilabel-internal-testing/example-generate-preference-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 23, 2024
Dataset authored and provided by
distilabel-internal-testing
Description
Dataset Card for example-preference-dataset

This dataset has been created with distilabel.

Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/sdiazlor/example-preference-dataset/raw/main/pipeline.yaml"

or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/distilabel-internal-testing/example-generate-preference-dataset.
Dataset example
kaggle.com
zip
Updated Apr 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javier Vallejos (2021). Dataset example [Dataset]. https://www.kaggle.com/javiervallejos/dataset-example
Explore at:
zip(38691 bytes)Available download formats
Dataset updated
Apr 27, 2021
Authors
Javier Vallejos
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

This dataset was created only for making examples every columns has generated with random values. If you wanna create a dataset similar like this review this notebook

Content

There are five columns 'Country' = 'Bolivia', :'Argentina','Paraguay','Chile','Brazil','Peru' 'Temperature' 'Humidity' 'Pm10' 'Date'
i
Dataset for fuzzzing data generation based on deep advisial learning
ieee-dataport.org
Updated May 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhihui Li (2022). Dataset for fuzzzing data generation based on deep advisial learning [Dataset]. https://ieee-dataport.org/documents/dataset-fuzzzing-data-generation-based-deep-advisial-learning
Explore at:
Dataset updated
May 18, 2022
Authors
Zhihui Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset was collected from an industrial control system running the Modbus protocol. It is used to train a deep adversarial learning model. This model is used to generate fuzzing data in the same format as the real one. The data is a sequence of hexadecimal numbers. The followed generated data is produced by the already trained model.
R
Generate Dataset
universe.roboflow.com
zip
Updated May 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
YoloProjectIVS (2025). Generate Dataset [Dataset]. https://universe.roboflow.com/yoloprojectivs/generate-3d288
Explore at:
zipAvailable download formats
Dataset updated
May 5, 2025
Dataset authored and provided by
YoloProjectIVS
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Objects HRDh Bounding Boxes
Description
Generate

## Overview Generate is a dataset for object detection tasks - it contains Objects HRDh annotations for 1,172 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
The code for generating and processing the dataset for load-displacement and...
figshare.com
txt
Updated Jan 19, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kheng Lim Goh (2018). The code for generating and processing the dataset for load-displacement and stress-strain [Dataset]. http://doi.org/10.6084/m9.figshare.5640649.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5640649.v2
Dataset updated
Jan 19, 2018
Dataset provided by
Figsharehttp://figshare.com/
Authors
Kheng Lim Goh
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The code, strainenergy_v4_1.m, was used for generating and processing the dataset for load-displacement and stress-strain. Software Matlab version 6.1 was used for running the code. The specific variables of the parameters used to generate the current dataset are as follows:• ip1: input file containing the load-displacement data• diameter: fascicle diameter• laststrainpt: an estimate of the strain at rupture, r• orderpoly: an integral value from 2-7 which represents the order of the polynomial for fitting to the data from O to q• loadat1percent: y/n; to determine the value of the load (set at 1% of the maximum load) at which the specimen became taut. ‘y’ denotes yes; ‘n’ denotes no.The logfile.txt, contains the parameters used for deriving the values of the respective mechanical properties.
h
generate-quiz-dataset
huggingface.co
Updated Jul 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fauzan Rizky (2024). generate-quiz-dataset [Dataset]. https://huggingface.co/datasets/fauzanrrizky/generate-quiz-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 20, 2024
Authors
Fauzan Rizky
Description
fauzanrrizky/generate-quiz-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
i
Random Numbers
ieee-dataport.org
Updated Mar 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander Outman (2023). Random Numbers [Dataset]. https://ieee-dataport.org/documents/random-numbers
Explore at:
Dataset updated
Mar 14, 2023
Authors
Alexander Outman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset includes random number generated through various methods.Method 1: shuf https://www.mankier.com/1/shufCommands used to generate dataset files: $ shuf -i 1-1000000000 -n1000000 -o random-shuf.txt$ shuf -i 1-1000000000000 -n1000000 -o random-shuf-1-1000000000000.txt$ jot -r 1000000 1 1000000000000 > random-jot-1-1000000000000.txt
h
my-dataset-generate
huggingface.co
Updated Jan 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bipul Sharma (2025). my-dataset-generate [Dataset]. https://huggingface.co/datasets/Bipul8765/my-dataset-generate
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 12, 2025
Authors
Bipul Sharma
Description
Dataset Card for my-dataset-generate

This dataset has been created with distilabel.

Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/Bipul8765/my-dataset-generate/raw/main/pipeline.yaml"

or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/Bipul8765/my-dataset-generate.
Invoices Dataset
kaggle.com
Updated Jan 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cankat Saraç (2022). Invoices Dataset [Dataset]. https://www.kaggle.com/datasets/cankatsrc/invoices/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 18, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Cankat Saraç
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
The invoice dataset provided is a mock dataset generated using the Python Faker library. It has been designed to mimic the format of data collected from an online store. The dataset contains various fields, including first name, last name, email, product ID, quantity, amount, invoice date, address, city, and stock code. All of the data in the dataset is randomly generated and does not represent actual individuals or products. The dataset can be used for various purposes, including testing algorithms or models related to invoice management, e-commerce, or customer behavior analysis. The data in this dataset can be used to identify trends, patterns, or anomalies in online shopping behavior, which can help businesses to optimize their online sales strategies.
i
CREATE: Multimodal Dataset for Unsupervised Learning and Generative Modeling...
ieee-dataport.org
Updated Jun 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simon Brodeur (2025). CREATE: Multimodal Dataset for Unsupervised Learning and Generative Modeling of Sensory Data from a Mobile Robot [Dataset]. https://ieee-dataport.org/open-access/create-multimodal-dataset-unsupervised-learning-and-generative-modeling-sensory-data
Explore at:
Dataset updated
Jun 17, 2025
Authors
Simon Brodeur
Description
The CREATE database is composed of 14 hours of multimodal recordings from a mobile robotic platform based on the iRobot Create.
f
Search strings used to generate citation counts for three data sets in WoS,...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Mar 26, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Belter, Christopher W. (2014). Search strings used to generate citation counts for three data sets in WoS, publishers' full text websites, and Google Scholar. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001239723
Explore at:
Dataset updated
Mar 26, 2014
Authors
Belter, Christopher W.
Description
Search strings used to generate citation counts for three data sets in WoS, publishers' full text websites, and Google Scholar.
S
The big model fine-tuning data set of five key elements of tourism resources...
scidb.cn
Updated Oct 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
lu bao qing; Chen Min; Wan Fucheng; Yu Hongzhi (2024). The big model fine-tuning data set of five key elements of tourism resources in the five northwestern provinces in 2024 [Dataset]. http://doi.org/10.57760/sciencedb.j00001.01088
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.j00001.01088
Dataset updated
Oct 17, 2024
Dataset provided by
Science Data Bank
Authors
lu bao qing; Chen Min; Wan Fucheng; Yu Hongzhi
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
With the wide application of large models in various fields, the demand for high-quality data sets in the tourism industry is increasing to support the improvement of the model 's ability to understand and generate tourism information. This dataset focuses on textual data in the tourism domain and is designed to support fine-tuning tasks for tourism-oriented large models, aiming to enhance the model's ability to understand and generate tourism-related information. The diversity and quality of the dataset are critical to the model's performance. Therefore, this study combines web scraping and manual annotation techniques, along with data cleaning, denoising, and stopword removal, to ensure high data quality and accuracy. Additionally, automated annotation tools are used to generate instructions and perform consistency checks on the texts. The LLM-Tourism dataset primarily relies on data from Ctrip and Baidu Baike, covering five Northwestern Chinese provinces: Gansu, Ningxia, Qinghai, Shaanxi, and Xinjiang, containing 53,280 pairs of structured data in JSON format. The creation of this dataset will not only improve the generation accuracy of tourism large models but also contribute to the sharing and application of tourism-related datasets in the field of large models.
E
Rule-based Synthetic Data for Japanese GEC
live.european-language-grid.eu
data.niaid.nih.gov
+1more
tsv
Updated Oct 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Rule-based Synthetic Data for Japanese GEC [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7679
Explore at:
tsvAvailable download formats
Dataset updated
Oct 28, 2023
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Title: Rule-based Synthetic Data for Japanese GEC. Dataset Contents:This dataset contains two parallel corpora intended for the training and evaluating of models for the NLP (natural language processing) subtask of Japanese GEC (grammatical error correction). These are as follows:Synthetic Corpus - synthesized_data.tsv. This corpus file contains 2,179,130 parallel sentence pairs synthesized using the process described in [1]. Each line of the file consists of two sentences delimited by a tab. The first sentence is the erroneous sentence while the second is the corresponding correction.These paired sentences are derived from data scraped from the keyword-lookup site
f
Appendix A. Parameter values used to generate expected value data sets.
datasetcatalog.nlm.nih.gov
wiley.figshare.com
Updated Aug 9, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bailey, Larissa L.; Kendall, William L.; Converse, Sarah J. (2016). Appendix A. Parameter values used to generate expected value data sets. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001582396
Explore at:
Dataset updated
Aug 9, 2016
Authors
Bailey, Larissa L.; Kendall, William L.; Converse, Sarah J.
Description
Parameter values used to generate expected value data sets.
R
Connector Generate Dataset Dataset
universe.roboflow.com
zip
Updated Feb 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
yang junzhi (2023). Connector Generate Dataset Dataset [Dataset]. https://universe.roboflow.com/yang-junzhi/connector-generate-dataset/dataset/4
Explore at:
zipAvailable download formats
Dataset updated
Feb 23, 2023
Dataset authored and provided by
yang junzhi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Defect Bounding Boxes
Description
Connector Generate Dataset

## Overview Connector Generate Dataset is a dataset for object detection tasks - it contains Defect annotations for 255 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
f
Dataset for: Simulation and data-generation for random-effects network...
wiley.figshare.com
txt
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Svenja Elisabeth Seide; Katrin Jensen; Meinhard Kieser (2023). Dataset for: Simulation and data-generation for random-effects network meta-analysis of binary outcome [Dataset]. http://doi.org/10.6084/m9.figshare.8001863.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.8001863.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Wiley
Authors
Svenja Elisabeth Seide; Katrin Jensen; Meinhard Kieser
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The performance of statistical methods is frequently evaluated by means of simulation studies. In case of network meta-analysis of binary data, however, available data- generating models are restricted to either inclusion of two-armed trials or the fixed-effect model. Based on data-generation in the pairwise case, we propose a framework for the simulation of random-effect network meta-analyses including multi-arm trials with binary outcome. The only of the common data-generating models which is directly applicable to a random-effects network setting uses strongly restrictive assumptions. To overcome these limitations, we modify this approach and derive a related simulation procedure using odds ratios as effect measure. The performance of this procedure is evaluated with synthetic data and in an empirical example.
R
Ai Generate Detection Dataset
universe.roboflow.com
zip
Updated Feb 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Blaze Warriors (2025). Ai Generate Detection Dataset [Dataset]. https://universe.roboflow.com/blaze-warriors/ai-generate-detection/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
Feb 19, 2025
Dataset authored and provided by
Blaze Warriors
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
AI Human
Description
AI Generate Detection

## Overview AI Generate Detection is a dataset for classification tasks - it contains AI Human annotations for 9,900 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
R
Dataset Dog Tail After Generate 1 Class Tail Dataset
universe.roboflow.com
zip
Updated Apr 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
25425 (2025). Dataset Dog Tail After Generate 1 Class Tail Dataset [Dataset]. https://universe.roboflow.com/25425/dataset-dog-tail-after-generate-1-class-tail
Explore at:
zipAvailable download formats
Dataset updated
Apr 27, 2025
Dataset authored and provided by
25425
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Dogs 6dBV Bounding Boxes
Description
Dataset Dog Tail After Generate 1 Class Tail

## Overview Dataset Dog Tail After Generate 1 Class Tail is a dataset for object detection tasks - it contains Dogs 6dBV annotations for 2,252 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Data used by EPA researchers to generate illustrative figures for overview...
datasets.ai
s.cnmilf.com
+1more
57
Updated Sep 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Environmental Protection Agency (2024). Data used by EPA researchers to generate illustrative figures for overview article "Multiscale Modeling of Background Ozone: Research Needs to Inform and Improve Air Quality Management" [Dataset]. https://datasets.ai/datasets/data-used-by-epa-researchers-to-generate-illustrative-figures-for-overview-article-multisc
Explore at:
57Available download formats
Dataset updated
Sep 11, 2024
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Authors
U.S. Environmental Protection Agency
Description
Data sets used to prepare illustrative figures for the overview article “Multiscale Modeling of Background Ozone” Overview

The CMAQ model output datasets used to create illustrative figures for this overview article were generated by scientists in EPA/ORD/CEMM and EPA/OAR/OAQPS.

The EPA/ORD/CEMM-generated dataset consisted of hourly CMAQ output from two simulations. The first simulation was performed for July 1 – 31 over a 12 km modeling domain covering the Western U.S. The simulation was configured with the Integrated Source Apportionment Method (ISAM) to estimate the contributions from 9 source categories to modeled ozone. ISAM source contributions for July 17 – 31 averaged over all grid cells located in Colorado were used to generate the illustrative pie chart in the overview article. The second simulation was performed for October 1, 2013 – August 31, 2014 over a 108 km modeling domain covering the northern hemisphere. This simulation was also configured with ISAM to estimate the contributions from non-US anthropogenic sources, natural sources, stratospheric ozone, and other sources on ozone concentrations. Ozone ISAM results from this simulation were extracted along a boundary curtain of the 12 km modeling domain specified over the Western U.S. for the time period January 1, 2014 – July 31, 2014 and used to generate the illustrative time-height cross-sections in the overview article.

The EPA/OAR/OAQPS-generated dataset consisted of hourly gridded CMAQ output for surface ozone concentrations for the year 2016. The CMAQ simulations were performed over the northern hemisphere at a horizontal resolution of 108 km. NO2 and O3 data for July 2016 was extracted from these simulations generate the vertically-integrated column densities shown in the illustrative comparison to satellite-derived column densities.

CMAQ Model Data

The data from the CMAQ model simulations used in this research effort are very large (several terabytes) and cannot be uploaded to ScienceHub due to size restrictions. The model simulations are stored on the /asm archival system accessible through the atmos high-performance computing (HPC) system. Due to data management policies, files on /asm are subject to expiry depending on the template of the project. Files not requested for extension after the expiry date are deleted permanently from the system. The format of the files used in this analysis and listed below is ioapi/netcdf. Documentation of this format, including definitions of the geographical projection attributes contained in the file headers, are available at https://www.cmascenter.org/ioapi/

Documentation on the CMAQ model, including a description of the output file format and output model species can be found in the CMAQ documentation on the CMAQ GitHub site at https://github.com/USEPA/CMAQ.

This dataset is associated with the following publication: Hogrefe, C., B. Henderson, G. Tonnesen, R. Mathur, and R. Matichuk. Multiscale Modeling of Background Ozone: Research Needs to Inform and Improve Air Quality Management. EM Magazine. Air and Waste Management Association, Pittsburgh, PA, USA, 1-6, (2020).
R
Generate Ray Dataset
universe.roboflow.com
zip
Updated Dec 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
test (2024). Generate Ray Dataset [Dataset]. https://universe.roboflow.com/test-szbyx/generate-ray/dataset/2
Explore at:
zipAvailable download formats
Dataset updated
Dec 22, 2024
Dataset authored and provided by
test
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
0 1 2 3 4 KH9O Bounding Boxes
Description
Generate Ray

## Overview Generate Ray is a dataset for object detection tasks - it contains 0 1 2 3 4 KH9O annotations for 279 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).

Facebook

Twitter

Click to copy link

Link copied

Cite

distilabel-internal-testing (2024). example-generate-preference-dataset [Dataset]. https://huggingface.co/datasets/distilabel-internal-testing/example-generate-preference-dataset

example-generate-preference-dataset

distilabel-internal-testing/example-generate-preference-dataset

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Aug 23, 2024

Dataset authored and provided by

distilabel-internal-testing

Description

Dataset Card for example-preference-dataset

This dataset has been created with distilabel.

  Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/sdiazlor/example-preference-dataset/raw/main/pipeline.yaml"

or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/distilabel-internal-testing/example-generate-preference-dataset.

Clear search

Close search

Google apps

Main menu

example-generate-preference-dataset

Dataset example

Context

Content

Dataset for fuzzzing data generation based on deep advisial learning

Generate Dataset

Generate

The code for generating and processing the dataset for load-displacement and...

generate-quiz-dataset

Random Numbers

my-dataset-generate

Invoices Dataset

CREATE: Multimodal Dataset for Unsupervised Learning and Generative Modeling...

Search strings used to generate citation counts for three data sets in WoS,...

The big model fine-tuning data set of five key elements of tourism resources...

Rule-based Synthetic Data for Japanese GEC

Appendix A. Parameter values used to generate expected value data sets.

Connector Generate Dataset Dataset

Connector Generate Dataset

Dataset for: Simulation and data-generation for random-effects network...

Ai Generate Detection Dataset

AI Generate Detection

Dataset Dog Tail After Generate 1 Class Tail Dataset

Dataset Dog Tail After Generate 1 Class Tail

Data used by EPA researchers to generate illustrative figures for overview...

Generate Ray Dataset

Generate Ray

example-generate-preference-datasetSee More Versions

distilabel-internal-testing/example-generate-preference-dataset

example-generate-preference-dataset