ToolBench is an instruction-tuning dataset for tool use, which is created automatically using ChatGPT. Specifically, the authors collect 16,464 real-world RESTful APIs spanning 49 categories from RapidAPI Hub, then prompt ChatgPT to generate diverse human instructions involving these APIs, covering both single-tool and multi-tool scenarios.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This project (ToolLLM) aims to construct open-source, large-scale, high-quality instruction tuning SFT data to facilitate the construction of powerful LLMs with general tool-use capability. We aim to empower open-source LLMs to master thousands of diverse real-world APIs. We achieve this by collecting a high-quality instruction-tuning dataset. It is constructed automatically using the latest ChatGPT (gpt-3.5-turbo-16k), which is upgraded with enhanced function call capabilities. We provide the dataset, the corresponding training and evaluation scripts, and a capable model ToolLLaMA fine-tuned on ToolBench.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This archive contains the Jupyter notebook and associated (image) files used in the June 2021 evaluation of the FAIR Workbench.
Our test data has undergone five rounds of manual inspection and correction by five senior algorithm researcher with years of experience in NLP, CV, and LLM, taking about one month in total. It boasts extremely high quality and accuracy, with a tight connection between multiple rounds of missions, increasing difficulty, no unusable invalid data, and complete consistency with human distribution. Its evaluation results and conclusions are of great reference value for subsequent optimization in the Agent direction.
Specifically, the data quality optimization work went through the following stages:
The initial data was generated using our proposed Multi Agent Data Generation framework, covering all possible action spaces.
The test data was then divided according to four different types of actions defined by us and manually inspected and corrected by four different algorithm researcher. Specifically, since missions generated by LLM are always too formal and not colloquial enough, especially after the second mission, it is difficult to generate true multi-turn missions. Therefore, we conducted the first round of corrections based on the criteria of colloquialism and true multi-turn missions. Notably, in designing the third and fourth round missions, we added missions with long-term memory, a true multi-turn type, to increase the difficulty of the test set.
Note: In the actual construction process, the four algorithm researcher adopted a layer-by-layer approach, first generating a layer of data with the model, then manually inspecting and correcting it, before generating and correcting the next layer of data. This approach avoids the difficulty of ensuring overall correctness and maintaining data coherence when, after generating all layers of data at once, a problem in one layer requires corrections that often affect both the previous and subsequent layers. Thus, our layer-by-layer construction ensures strong logical consistency and close relationships between layers, without any unreasonable trajectories.
After the first round of corrections by the four algorithm researcher, one senior experts in the Agent field would comment on each piece of data, indicating whether it meets the requirements and what problems exist, followed by a second correction by the four algorithm researcher.
After the second round of corrections, we introduced cross-validation, where the four algorithm researcher inspected and commented on each other's data. Then, the four algorithm researcher and one senior experts in the Agent field discussed and made a third round of corrections on the doubtful data.
After the third round of corrections, the one senior experts in the Agent field separately conducted a fourth round of inspection and correction on all data to ensure absolute accuracy.
Finally, since human corrections might introduce errors, we used code to check for possible parameter type errors and unreasonable dependencies caused by manual operations, with one senior experts making the final fifth round of corrections.
Through these five stages of data quality optimization, each piece of data was manually corrected and constructed by multiple algorithm experts, improving our test data's accuracy from less than 60% initially to 100% correctness. The combination of model generation and multiple human corrections also endowed our data with excellent diversity and quality.
At the same time, compared to other benchmarks such as BFCL, T-EVAL, etc., our test data covers all possible action spaces, and in the second to fourth rounds of true multi-turn missions, the coverage rate has reached two 100%, which also makes our data distribution very balanced, capable of testing out the weaknesses of the model without any blind spots.
Ultimately, this high-quality data set we constructed laid the foundation for our subsequent experiments, lending absolute credibility to our conclusions.
Additionally, we provide bilingual support for the test data, including both English and Chinese versions, all of which have undergone the aforementioned manual inspection process. Subsequent LeadBoard results will primarily report the English version.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
WorFBench Benchmarking Agentic Workflow Generation
📄arXiv • 🤗HFPaper • 🌐Web • 🖥️Code • 📊Dataset
🌻Acknowledgement 🌟Overview 🔧Installation ✏️Model-Inference 📝Workflow-Generation 🤔Workflow-Evaluation
🌻Acknowledgement
Our code of training module is referenced and adapted from LLaMA-Factory. And the Dataset is collected from ToolBench, ToolAlpaca, Lumos, WikiHow, Seal-Tools, Alfworld, Webshop, IntercodeSql. Our end-to-end evaluation module is based on IPR… See the full description on the dataset page: https://huggingface.co/datasets/zjunlp/WorFBench_test.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
cpDNA reads were mapped to reference genomes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
CAMARADES dataset. (XLSX 17704 kb)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
OHAT datasets. (XLSX 4022 kb)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
UNEP EDCs. (XLSX 1432 kb)
Not seeing a result you expected?
Learn how you can add new datasets to our index.
ToolBench is an instruction-tuning dataset for tool use, which is created automatically using ChatGPT. Specifically, the authors collect 16,464 real-world RESTful APIs spanning 49 categories from RapidAPI Hub, then prompt ChatgPT to generate diverse human instructions involving these APIs, covering both single-tool and multi-tool scenarios.