9 datasets found

P
ToolBench Dataset
paperswithcode.com
Updated Aug 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yujia Qin; Shihao Liang; Yining Ye; Kunlun Zhu; Lan Yan; Yaxi Lu; Yankai Lin; Xin Cong; Xiangru Tang; Bill Qian; Sihan Zhao; Lauren Hong; Runchu Tian; Ruobing Xie; Jie zhou; Mark Gerstein; Dahai Li; Zhiyuan Liu; Maosong Sun (2024). ToolBench Dataset [Dataset]. https://paperswithcode.com/dataset/toolbench
Explore at:
Dataset updated
Aug 1, 2023
Authors
Yujia Qin; Shihao Liang; Yining Ye; Kunlun Zhu; Lan Yan; Yaxi Lu; Yankai Lin; Xin Cong; Xiangru Tang; Bill Qian; Sihan Zhao; Lauren Hong; Runchu Tian; Ruobing Xie; Jie zhou; Mark Gerstein; Dahai Li; Zhiyuan Liu; Maosong Sun
Description
ToolBench is an instruction-tuning dataset for tool use, which is created automatically using ChatGPT. Specifically, the authors collect 16,464 real-world RESTful APIs spanning 49 categories from RapidAPI Hub, then prompt ChatgPT to generate diverse human instructions involving these APIs, covering both single-tool and multi-tool scenarios.
O
ToolBench
opendatalab.com
zip
Updated Oct 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tsinghua University (2023). ToolBench [Dataset]. https://opendatalab.com/OpenDataLab/ToolBench
Explore at:
zipAvailable download formats
Dataset updated
Oct 3, 2023
Dataset provided by
Renmin University of China
Yale University
Tsinghua University
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This project (ToolLLM) aims to construct open-source, large-scale, high-quality instruction tuning SFT data to facilitate the construction of powerful LLMs with general tool-use capability. We aim to empower open-source LLMs to master thousands of diverse real-world APIs. We achieve this by collecting a high-quality instruction-tuning dataset. It is constructed automatically using the latest ChatGPT (gpt-3.5-turbo-16k), which is upgraded with enhanced function call capabilities. We provide the dataset, the corresponding training and evaluation scripts, and a capable model ToolLLaMA fine-tuned on ToolBench.
Evaluation notebook and files for FAIR Workbench user evaluation
zenodo.org
application/gzip
Updated Jun 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tobias Kuhn; Tobias Kuhn; Remzi Celebi; Remzi Celebi; Robin Richardson; Robin Richardson (2021). Evaluation notebook and files for FAIR Workbench user evaluation [Dataset]. http://doi.org/10.5281/zenodo.5045448
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5045448
Dataset updated
Jun 30, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tobias Kuhn; Tobias Kuhn; Remzi Celebi; Remzi Celebi; Robin Richardson; Robin Richardson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This archive contains the Jupyter notebook and associated (image) files used in the June 2021 evaluation of the FAIR Workbench.
P
MMTB Dataset
paperswithcode.com
Updated Apr 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peijie Yu; Yifan Yang; Jinjian Li; Zelong Zhang; Haorui Wang; Xiao Feng; Feng Zhang (2025). MMTB Dataset [Dataset]. https://paperswithcode.com/dataset/mmtb
Explore at:
Dataset updated
Apr 2, 2025
Authors
Peijie Yu; Yifan Yang; Jinjian Li; Zelong Zhang; Haorui Wang; Xiao Feng; Feng Zhang
Description
Our test data has undergone five rounds of manual inspection and correction by five senior algorithm researcher with years of experience in NLP, CV, and LLM, taking about one month in total. It boasts extremely high quality and accuracy, with a tight connection between multiple rounds of missions, increasing difficulty, no unusable invalid data, and complete consistency with human distribution. Its evaluation results and conclusions are of great reference value for subsequent optimization in the Agent direction.

Specifically, the data quality optimization work went through the following stages:

The initial data was generated using our proposed Multi Agent Data Generation framework, covering all possible action spaces.

The test data was then divided according to four different types of actions defined by us and manually inspected and corrected by four different algorithm researcher. Specifically, since missions generated by LLM are always too formal and not colloquial enough, especially after the second mission, it is difficult to generate true multi-turn missions. Therefore, we conducted the first round of corrections based on the criteria of colloquialism and true multi-turn missions. Notably, in designing the third and fourth round missions, we added missions with long-term memory, a true multi-turn type, to increase the difficulty of the test set.

Note: In the actual construction process, the four algorithm researcher adopted a layer-by-layer approach, first generating a layer of data with the model, then manually inspecting and correcting it, before generating and correcting the next layer of data. This approach avoids the difficulty of ensuring overall correctness and maintaining data coherence when, after generating all layers of data at once, a problem in one layer requires corrections that often affect both the previous and subsequent layers. Thus, our layer-by-layer construction ensures strong logical consistency and close relationships between layers, without any unreasonable trajectories.

After the first round of corrections by the four algorithm researcher, one senior experts in the Agent field would comment on each piece of data, indicating whether it meets the requirements and what problems exist, followed by a second correction by the four algorithm researcher.

After the second round of corrections, we introduced cross-validation, where the four algorithm researcher inspected and commented on each other's data. Then, the four algorithm researcher and one senior experts in the Agent field discussed and made a third round of corrections on the doubtful data.

After the third round of corrections, the one senior experts in the Agent field separately conducted a fourth round of inspection and correction on all data to ensure absolute accuracy.

Finally, since human corrections might introduce errors, we used code to check for possible parameter type errors and unreasonable dependencies caused by manual operations, with one senior experts making the final fifth round of corrections.

Through these five stages of data quality optimization, each piece of data was manually corrected and constructed by multiple algorithm experts, improving our test data's accuracy from less than 60% initially to 100% correctness. The combination of model generation and multiple human corrections also endowed our data with excellent diversity and quality.

At the same time, compared to other benchmarks such as BFCL, T-EVAL, etc., our test data covers all possible action spaces, and in the second to fourth rounds of true multi-turn missions, the coverage rate has reached two 100%, which also makes our data distribution very balanced, capable of testing out the weaknesses of the model without any blind spots.

Ultimately, this high-quality data set we constructed laid the foundation for our subsequent experiments, lending absolute credibility to our conclusions.

Additionally, we provide bilingual support for the test data, including both English and Chinese versions, all of which have undergone the aforementioned manual inspection process. Subsequent LeadBoard results will primarily report the English version.
h
WorFBench_test
huggingface.co
Updated Jun 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ZJUNLP (2025). WorFBench_test [Dataset]. https://huggingface.co/datasets/zjunlp/WorFBench_test
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 17, 2025
Dataset authored and provided by
ZJUNLP
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
WorFBench Benchmarking Agentic Workflow Generation

📄arXiv • 🤗HFPaper • 🌐Web • 🖥️Code • 📊Dataset

🌻Acknowledgement 🌟Overview 🔧Installation ✏️Model-Inference 📝Workflow-Generation 🤔Workflow-Evaluation

🌻Acknowledgement

Our code of training module is referenced and adapted from LLaMA-Factory. And the Dataset is collected from ToolBench, ToolAlpaca, Lumos, WikiHow, Seal-Tools, Alfworld, Webshop, IntercodeSql. Our end-to-end evaluation module is based on IPR… See the full description on the dataset page: https://huggingface.co/datasets/zjunlp/WorFBench_test.
Average coverage of cpDNA evaluated from selected conifers with CLC Genomics...
plos.figshare.com
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leila do Nascimento Vieira; Helisson Faoro; Hugo Pacheco de Freitas Fraga; Marcelo Rogalski; Emanuel Maltempi de Souza; Fábio de Oliveira Pedrosa; Rubens Onofre Nodari; Miguel Pedro Guerra (2023). Average coverage of cpDNA evaluated from selected conifers with CLC Genomics Workbench 5.5 software. [Dataset]. http://doi.org/10.1371/journal.pone.0084792.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0084792.t004
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Leila do Nascimento Vieira; Helisson Faoro; Hugo Pacheco de Freitas Fraga; Marcelo Rogalski; Emanuel Maltempi de Souza; Fábio de Oliveira Pedrosa; Rubens Onofre Nodari; Miguel Pedro Guerra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
cpDNA reads were mapped to reference genomes.
f
Additional file 2: of SWIFT-Review: a text-mining workbench for systematic...
springernature.figshare.com
xlsx
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brian Howard; Jason Phillips; Kyle Miller; Arpit Tandon; Deepak Mav; Mihir Shah; Stephanie Holmgren; Katherine Pelch; Vickie Walker; Andrew Rooney; Malcolm Macleod; Ruchir Shah; Kristina Thayer (2023). Additional file 2: of SWIFT-Review: a text-mining workbench for systematic review [Dataset]. http://doi.org/10.6084/m9.figshare.c.3613058_D5.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.c.3613058_D5.v1
Dataset updated
May 31, 2023
Dataset provided by
figshare
Authors
Brian Howard; Jason Phillips; Kyle Miller; Arpit Tandon; Deepak Mav; Mihir Shah; Stephanie Holmgren; Katherine Pelch; Vickie Walker; Andrew Rooney; Malcolm Macleod; Ruchir Shah; Kristina Thayer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
CAMARADES dataset. (XLSX 17704 kb)
f
Additional file 1: of SWIFT-Review: a text-mining workbench for systematic...
springernature.figshare.com
xlsx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brian Howard; Jason Phillips; Kyle Miller; Arpit Tandon; Deepak Mav; Mihir Shah; Stephanie Holmgren; Katherine Pelch; Vickie Walker; Andrew Rooney; Malcolm Macleod; Ruchir Shah; Kristina Thayer (2023). Additional file 1: of SWIFT-Review: a text-mining workbench for systematic review [Dataset]. http://doi.org/10.6084/m9.figshare.c.3613058_D2.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.c.3613058_D2.v1
Dataset updated
Jun 1, 2023
Dataset provided by
figshare
Authors
Brian Howard; Jason Phillips; Kyle Miller; Arpit Tandon; Deepak Mav; Mihir Shah; Stephanie Holmgren; Katherine Pelch; Vickie Walker; Andrew Rooney; Malcolm Macleod; Ruchir Shah; Kristina Thayer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
OHAT datasets. (XLSX 4022 kb)
f
Additional file 6: of SWIFT-Review: a text-mining workbench for systematic...
springernature.figshare.com
xlsx
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brian Howard; Jason Phillips; Kyle Miller; Arpit Tandon; Deepak Mav; Mihir Shah; Stephanie Holmgren; Katherine Pelch; Vickie Walker; Andrew Rooney; Malcolm Macleod; Ruchir Shah; Kristina Thayer (2023). Additional file 6: of SWIFT-Review: a text-mining workbench for systematic review [Dataset]. http://doi.org/10.6084/m9.figshare.c.3613058_D7.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.c.3613058_D7.v1
Dataset updated
Jun 4, 2023
Dataset provided by
figshare
Authors
Brian Howard; Jason Phillips; Kyle Miller; Arpit Tandon; Deepak Mav; Mihir Shah; Stephanie Holmgren; Katherine Pelch; Vickie Walker; Andrew Rooney; Malcolm Macleod; Ruchir Shah; Kristina Thayer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
UNEP EDCs. (XLSX 1432 kb)
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Yujia Qin; Shihao Liang; Yining Ye; Kunlun Zhu; Lan Yan; Yaxi Lu; Yankai Lin; Xin Cong; Xiangru Tang; Bill Qian; Sihan Zhao; Lauren Hong; Runchu Tian; Ruobing Xie; Jie zhou; Mark Gerstein; Dahai Li; Zhiyuan Liu; Maosong Sun (2024). ToolBench Dataset [Dataset]. https://paperswithcode.com/dataset/toolbench

ToolBench Dataset

Explore at:

Dataset updated

Aug 1, 2023

Authors

Description

ToolBench is an instruction-tuning dataset for tool use, which is created automatically using ChatGPT. Specifically, the authors collect 16,464 real-world RESTful APIs spanning 49 categories from RapidAPI Hub, then prompt ChatgPT to generate diverse human instructions involving these APIs, covering both single-tool and multi-tool scenarios.

Clear search

Close search

Google apps

Main menu

ToolBench Dataset

ToolBench

Evaluation notebook and files for FAIR Workbench user evaluation

MMTB Dataset

WorFBench_test

Average coverage of cpDNA evaluated from selected conifers with CLC Genomics...

Additional file 2: of SWIFT-Review: a text-mining workbench for systematic...

Additional file 1: of SWIFT-Review: a text-mining workbench for systematic...

Additional file 6: of SWIFT-Review: a text-mining workbench for systematic...

ToolBench Dataset