9 datasets found
  1. P

    ToolBench Dataset

    • paperswithcode.com
    Updated Aug 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yujia Qin; Shihao Liang; Yining Ye; Kunlun Zhu; Lan Yan; Yaxi Lu; Yankai Lin; Xin Cong; Xiangru Tang; Bill Qian; Sihan Zhao; Lauren Hong; Runchu Tian; Ruobing Xie; Jie zhou; Mark Gerstein; Dahai Li; Zhiyuan Liu; Maosong Sun (2024). ToolBench Dataset [Dataset]. https://paperswithcode.com/dataset/toolbench
    Explore at:
    Dataset updated
    Aug 1, 2023
    Authors
    Yujia Qin; Shihao Liang; Yining Ye; Kunlun Zhu; Lan Yan; Yaxi Lu; Yankai Lin; Xin Cong; Xiangru Tang; Bill Qian; Sihan Zhao; Lauren Hong; Runchu Tian; Ruobing Xie; Jie zhou; Mark Gerstein; Dahai Li; Zhiyuan Liu; Maosong Sun
    Description

    ToolBench is an instruction-tuning dataset for tool use, which is created automatically using ChatGPT. Specifically, the authors collect 16,464 real-world RESTful APIs spanning 49 categories from RapidAPI Hub, then prompt ChatgPT to generate diverse human instructions involving these APIs, covering both single-tool and multi-tool scenarios.

  2. O

    ToolBench

    • opendatalab.com
    zip
    Updated Oct 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tsinghua University (2023). ToolBench [Dataset]. https://opendatalab.com/OpenDataLab/ToolBench
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 3, 2023
    Dataset provided by
    Renmin University of China
    Yale University
    Tsinghua University
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This project (ToolLLM) aims to construct open-source, large-scale, high-quality instruction tuning SFT data to facilitate the construction of powerful LLMs with general tool-use capability. We aim to empower open-source LLMs to master thousands of diverse real-world APIs. We achieve this by collecting a high-quality instruction-tuning dataset. It is constructed automatically using the latest ChatGPT (gpt-3.5-turbo-16k), which is upgraded with enhanced function call capabilities. We provide the dataset, the corresponding training and evaluation scripts, and a capable model ToolLLaMA fine-tuned on ToolBench.

  3. Evaluation notebook and files for FAIR Workbench user evaluation

    • zenodo.org
    application/gzip
    Updated Jun 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tobias Kuhn; Tobias Kuhn; Remzi Celebi; Remzi Celebi; Robin Richardson; Robin Richardson (2021). Evaluation notebook and files for FAIR Workbench user evaluation [Dataset]. http://doi.org/10.5281/zenodo.5045448
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jun 30, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Tobias Kuhn; Tobias Kuhn; Remzi Celebi; Remzi Celebi; Robin Richardson; Robin Richardson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This archive contains the Jupyter notebook and associated (image) files used in the June 2021 evaluation of the FAIR Workbench.

  4. P

    MMTB Dataset

    • paperswithcode.com
    Updated Apr 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peijie Yu; Yifan Yang; Jinjian Li; Zelong Zhang; Haorui Wang; Xiao Feng; Feng Zhang (2025). MMTB Dataset [Dataset]. https://paperswithcode.com/dataset/mmtb
    Explore at:
    Dataset updated
    Apr 2, 2025
    Authors
    Peijie Yu; Yifan Yang; Jinjian Li; Zelong Zhang; Haorui Wang; Xiao Feng; Feng Zhang
    Description

    Our test data has undergone five rounds of manual inspection and correction by five senior algorithm researcher with years of experience in NLP, CV, and LLM, taking about one month in total. It boasts extremely high quality and accuracy, with a tight connection between multiple rounds of missions, increasing difficulty, no unusable invalid data, and complete consistency with human distribution. Its evaluation results and conclusions are of great reference value for subsequent optimization in the Agent direction.

    Specifically, the data quality optimization work went through the following stages:

    The initial data was generated using our proposed Multi Agent Data Generation framework, covering all possible action spaces.

    The test data was then divided according to four different types of actions defined by us and manually inspected and corrected by four different algorithm researcher. Specifically, since missions generated by LLM are always too formal and not colloquial enough, especially after the second mission, it is difficult to generate true multi-turn missions. Therefore, we conducted the first round of corrections based on the criteria of colloquialism and true multi-turn missions. Notably, in designing the third and fourth round missions, we added missions with long-term memory, a true multi-turn type, to increase the difficulty of the test set.

    Note: In the actual construction process, the four algorithm researcher adopted a layer-by-layer approach, first generating a layer of data with the model, then manually inspecting and correcting it, before generating and correcting the next layer of data. This approach avoids the difficulty of ensuring overall correctness and maintaining data coherence when, after generating all layers of data at once, a problem in one layer requires corrections that often affect both the previous and subsequent layers. Thus, our layer-by-layer construction ensures strong logical consistency and close relationships between layers, without any unreasonable trajectories.

    After the first round of corrections by the four algorithm researcher, one senior experts in the Agent field would comment on each piece of data, indicating whether it meets the requirements and what problems exist, followed by a second correction by the four algorithm researcher.

    After the second round of corrections, we introduced cross-validation, where the four algorithm researcher inspected and commented on each other's data. Then, the four algorithm researcher and one senior experts in the Agent field discussed and made a third round of corrections on the doubtful data.

    After the third round of corrections, the one senior experts in the Agent field separately conducted a fourth round of inspection and correction on all data to ensure absolute accuracy.

    Finally, since human corrections might introduce errors, we used code to check for possible parameter type errors and unreasonable dependencies caused by manual operations, with one senior experts making the final fifth round of corrections.

    Through these five stages of data quality optimization, each piece of data was manually corrected and constructed by multiple algorithm experts, improving our test data's accuracy from less than 60% initially to 100% correctness. The combination of model generation and multiple human corrections also endowed our data with excellent diversity and quality.

    At the same time, compared to other benchmarks such as BFCL, T-EVAL, etc., our test data covers all possible action spaces, and in the second to fourth rounds of true multi-turn missions, the coverage rate has reached two 100%, which also makes our data distribution very balanced, capable of testing out the weaknesses of the model without any blind spots.

    Ultimately, this high-quality data set we constructed laid the foundation for our subsequent experiments, lending absolute credibility to our conclusions.

    Additionally, we provide bilingual support for the test data, including both English and Chinese versions, all of which have undergone the aforementioned manual inspection process. Subsequent LeadBoard results will primarily report the English version.

  5. h

    WorFBench_test

    • huggingface.co
    Updated Jun 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ZJUNLP (2025). WorFBench_test [Dataset]. https://huggingface.co/datasets/zjunlp/WorFBench_test
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 17, 2025
    Dataset authored and provided by
    ZJUNLP
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    WorFBench Benchmarking Agentic Workflow Generation

    📄arXiv • 🤗HFPaper • 🌐Web • 🖥️Code • 📊Dataset

    🌻Acknowledgement 🌟Overview 🔧Installation ✏️Model-Inference 📝Workflow-Generation 🤔Workflow-Evaluation

      🌻Acknowledgement
    

    Our code of training module is referenced and adapted from LLaMA-Factory. And the Dataset is collected from ToolBench, ToolAlpaca, Lumos, WikiHow, Seal-Tools, Alfworld, Webshop, IntercodeSql. Our end-to-end evaluation module is based on IPR… See the full description on the dataset page: https://huggingface.co/datasets/zjunlp/WorFBench_test.

  6. Average coverage of cpDNA evaluated from selected conifers with CLC Genomics...

    • plos.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leila do Nascimento Vieira; Helisson Faoro; Hugo Pacheco de Freitas Fraga; Marcelo Rogalski; Emanuel Maltempi de Souza; Fábio de Oliveira Pedrosa; Rubens Onofre Nodari; Miguel Pedro Guerra (2023). Average coverage of cpDNA evaluated from selected conifers with CLC Genomics Workbench 5.5 software. [Dataset]. http://doi.org/10.1371/journal.pone.0084792.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Leila do Nascimento Vieira; Helisson Faoro; Hugo Pacheco de Freitas Fraga; Marcelo Rogalski; Emanuel Maltempi de Souza; Fábio de Oliveira Pedrosa; Rubens Onofre Nodari; Miguel Pedro Guerra
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    cpDNA reads were mapped to reference genomes.

  7. f

    Additional file 2: of SWIFT-Review: a text-mining workbench for systematic...

    • springernature.figshare.com
    xlsx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brian Howard; Jason Phillips; Kyle Miller; Arpit Tandon; Deepak Mav; Mihir Shah; Stephanie Holmgren; Katherine Pelch; Vickie Walker; Andrew Rooney; Malcolm Macleod; Ruchir Shah; Kristina Thayer (2023). Additional file 2: of SWIFT-Review: a text-mining workbench for systematic review [Dataset]. http://doi.org/10.6084/m9.figshare.c.3613058_D5.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    figshare
    Authors
    Brian Howard; Jason Phillips; Kyle Miller; Arpit Tandon; Deepak Mav; Mihir Shah; Stephanie Holmgren; Katherine Pelch; Vickie Walker; Andrew Rooney; Malcolm Macleod; Ruchir Shah; Kristina Thayer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    CAMARADES dataset. (XLSX 17704 kb)

  8. f

    Additional file 1: of SWIFT-Review: a text-mining workbench for systematic...

    • springernature.figshare.com
    xlsx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brian Howard; Jason Phillips; Kyle Miller; Arpit Tandon; Deepak Mav; Mihir Shah; Stephanie Holmgren; Katherine Pelch; Vickie Walker; Andrew Rooney; Malcolm Macleod; Ruchir Shah; Kristina Thayer (2023). Additional file 1: of SWIFT-Review: a text-mining workbench for systematic review [Dataset]. http://doi.org/10.6084/m9.figshare.c.3613058_D2.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    figshare
    Authors
    Brian Howard; Jason Phillips; Kyle Miller; Arpit Tandon; Deepak Mav; Mihir Shah; Stephanie Holmgren; Katherine Pelch; Vickie Walker; Andrew Rooney; Malcolm Macleod; Ruchir Shah; Kristina Thayer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    OHAT datasets. (XLSX 4022 kb)

  9. f

    Additional file 6: of SWIFT-Review: a text-mining workbench for systematic...

    • springernature.figshare.com
    xlsx
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brian Howard; Jason Phillips; Kyle Miller; Arpit Tandon; Deepak Mav; Mihir Shah; Stephanie Holmgren; Katherine Pelch; Vickie Walker; Andrew Rooney; Malcolm Macleod; Ruchir Shah; Kristina Thayer (2023). Additional file 6: of SWIFT-Review: a text-mining workbench for systematic review [Dataset]. http://doi.org/10.6084/m9.figshare.c.3613058_D7.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    figshare
    Authors
    Brian Howard; Jason Phillips; Kyle Miller; Arpit Tandon; Deepak Mav; Mihir Shah; Stephanie Holmgren; Katherine Pelch; Vickie Walker; Andrew Rooney; Malcolm Macleod; Ruchir Shah; Kristina Thayer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    UNEP EDCs. (XLSX 1432 kb)

  10. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Yujia Qin; Shihao Liang; Yining Ye; Kunlun Zhu; Lan Yan; Yaxi Lu; Yankai Lin; Xin Cong; Xiangru Tang; Bill Qian; Sihan Zhao; Lauren Hong; Runchu Tian; Ruobing Xie; Jie zhou; Mark Gerstein; Dahai Li; Zhiyuan Liu; Maosong Sun (2024). ToolBench Dataset [Dataset]. https://paperswithcode.com/dataset/toolbench

ToolBench Dataset

Explore at:
Dataset updated
Aug 1, 2023
Authors
Yujia Qin; Shihao Liang; Yining Ye; Kunlun Zhu; Lan Yan; Yaxi Lu; Yankai Lin; Xin Cong; Xiangru Tang; Bill Qian; Sihan Zhao; Lauren Hong; Runchu Tian; Ruobing Xie; Jie zhou; Mark Gerstein; Dahai Li; Zhiyuan Liu; Maosong Sun
Description

ToolBench is an instruction-tuning dataset for tool use, which is created automatically using ChatGPT. Specifically, the authors collect 16,464 real-world RESTful APIs spanning 49 categories from RapidAPI Hub, then prompt ChatgPT to generate diverse human instructions involving these APIs, covering both single-tool and multi-tool scenarios.

Search
Clear search
Close search
Google apps
Main menu