Final Data - LLaMA Fine-Tuning Dataset
This dataset is prepared for fine-tuning the meta-llama/Llama-2-7b-hf model using the TRL SFTTrainer.
Structure
train.json: Training examples in JSON format validation.json: Validation examples test.json: Optional test examples
Format
Each file contains a list of items with this format: { "text": "Your training sample here..." } from datasets import load_dataset
dataset = load_dataset("csenaeem/final_data")
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
OpenAssistant TOP-1 English Conversations
This is a twice filtered dataset from oasst2, which is a set of conversation trees collected by the OpenAssistant project. It was first filtered for the top ranked branches in each conversation tree, to form blancsw/oasst2_top1_chat_format It was then filtered down to English-only, and to a single 'messages' data column. This allows the dataset to directly be input to the HuggingFace SFTTrainer (provided your tokenizer has a chat template)… See the full description on the dataset page: https://huggingface.co/datasets/Trelis/oasst2_top1_chat_format_en.
Dataset Description
Abstract
The Abstract Paper Reviews Dataset is designed for training machine learning models to generate reviews of academic papers based on the paper's title and abstract. It is formatted in a conversational style, facilitating direct use with models like the SFTTrainer without the need for additional parsing or conversion into a chat template. This dataset enables the development of models that can assist in peer review processes by providing… See the full description on the dataset page: https://huggingface.co/datasets/travis0103/abstract_paper_review.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Reasoning-1 1K
Short about
This dataset will help in SFT training of LLM on the Alpaca format. The goal of the dataset: to teach LLM to reason and analyze its mistakes using SFT training. The size of 1.15K is quite small, so for effective training on SFTTrainer set 4-6 epochs instead of 1-3. Made by Fluently Team (@ehristoforu) using distilabel with love🥰
Dataset structure
This subset can be loaded as: from datasets import load_dataset
ds =… See the full description on the dataset page: https://huggingface.co/datasets/fluently-sets/reasoning-1-1k.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Final Data - LLaMA Fine-Tuning Dataset
This dataset is prepared for fine-tuning the meta-llama/Llama-2-7b-hf model using the TRL SFTTrainer.
Structure
train.json: Training examples in JSON format validation.json: Validation examples test.json: Optional test examples
Format
Each file contains a list of items with this format: { "text": "Your training sample here..." } from datasets import load_dataset
dataset = load_dataset("csenaeem/final_data")