Dataset Card for "llama2-sst2-finetuning"
Dataset Description
The Llama2-sst2-fine-tuning dataset is designed for supervised fine-tuning of the LLaMA V2 based on the GLUE SST2 for sentiment analysis classification task.We provide two subsets: training and validation.To ensure the effectiveness of fine-tuning, we convert the data into the prompt template for LLaMA V2 supervised fine-tuning, where the data will follow this format:
[INST] <
Hmoumad/Prepared-Dataset-Fine-Tune-Llama-2 dataset hosted on Hugging Face and contributed by the HF Datasets community
Container dataset for demonstration of Hugging Face models on Redivis. Currently just contains a single BERT model, but may expand in the future.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
BAAI_bge-small-en-v1_5-02082024-vrdv-webapp Dataset
Dataset Description
The dataset "general domain" is a generated dataset designed to support the development of domain specific embedding models for retrieval tasks.
Associated Model
This dataset was used to train the BAAI_bge-small-en-v1_5-02082024-vrdv-webapp model.
How to Use
To use this dataset for model training or evaluation, you can load it using the Hugging Face datasets library as follows:… See the full description on the dataset page: https://huggingface.co/datasets/fine-tuned/BAAI_bge-small-en-v1_5-02082024-vrdv-webapp.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Details
The data set has about 1 Million Tokens for Training and about 1500 question answers.
Dataset Description
This dataset is a comprehensive compilation of questions related to dermatology, spanning inquiries about various skin diseases, their symptoms, recommended medications, and available treatment modalities. Each question is paired with a concise and informative response, making it an ideal resource for training and fine-tuning language models in the… See the full description on the dataset page: https://huggingface.co/datasets/Mreeb/Dermatology-Question-Answer-Dataset-For-Fine-Tuning.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
https://raw.githubusercontent.com/tingxueronghua/ChartLlama-code/refs/heads/main/static/teaser_visualization_final_v3.png" alt="teaser_visualization_final_v3">
A link to the original dataset located on HuggingFace: https://huggingface.co/datasets/listen2you002/ChartLlama-Dataset
This dataset can be used to fine-tune Visual Language Models (VVM) for the Visual question answering (VQA) task (answering the question about graphs and diagrams)
Table with examples of content
model | conversations | id | image |
---|---|---|---|
[ { "from": "human", "value": " | |||
What is the title of the chart?" }, { "from": "gpt", "value": "Analysis of smartphone usage patterns" } ] | ours_simplified_qa_37_0 | ours/box_chart/png/box_chart_100examples_37.png | |
[ { "from": "human", "value": "What are the outliers in the Microwave usage data? | |||
ours_simplified_qa_56_2 | ours/box_chart/png/box_chart_100examples_56.png | ||
[ { "from": "human", "value": "What's the food consumption of USA in Year 2? | |||
ours_simplified_qa_69_0 | ours/box_chart/png/box_chart_100examples_69.png |
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
More details will be added
Est-RoBERTa is a monolingual Estonian RoBERTa-like language representation model. It was trained on Estonian corpora, containing mostly news articles, with 2.51 billion tokens in total.
The model can be used for various NLP classification tasks by fine tuning the model end-to-end or alternatively by extracting the word embedding vectors for each word occurrence and using the vectors as input. The model vocabulary consists of 40,000 (subword) tokens. Any word not present in the vocabulary gets split into subword tokens, eg. "identification" might get split as "▁identif ic ation". The tokens that form the beginning of a word (or the whole word) have a special character (▁) prepended (that is not underscore character). Other tokens that form a non-beginning part of a word do not have any characters prepended or appended.
The model configuration is in pytorch format, specifically for usage with transformers toolset by Huggingface (https://huggingface.co/transformers/), where it is also hosted already (https://huggingface.co/EMBEDDIA/est-roberta)
Incorrect12321/fine-tuning-dataset-mental-models-Llama3.1-8B dataset hosted on Hugging Face and contributed by the HF Datasets community
Overview The LaMini Dataset is an instruction dataset generated using h2ogpt-gm-oasst1-en-2048-falcon-40b-v2. It is designed for instruction-tuning pre-trained models to specialize them in a variety of downstream tasks.
Dataset Generation
Base Model: h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2. Seed Instructions: Sourced from databricks/databricks-dolly-15k dataset. Generation Approach: Example-guided and topic-guided strategies. Total Instructions: 1,504 unique instruction examples.
Dataset Sources
Repository: Bitbucket Project Paper : Pre-Print
Structure Each entry in the dataset contains: - Instruction - Response
Usage The LaMini Dataset can be used to fine-tune language models to improve their ability to follow instructions and generate relevant responses.
Access The dataset is available on HuggingFace at the following link: https://huggingface.co/datasets/SurgeGlobal/LaMini
Citation If you find our work useful, please cite our paper as follows: @misc{surge2024openbezoar, title={OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data}, author={Chandeepa Dissanayake and Lahiru Lowe and Sachith Gunasekara and Yasiru Ratnayake}, year={2024}, eprint={2404.12195}, archivePrefix={arXiv}, primaryClass={cs.CL} }
Dataset Authors Chandeepa Dissanayake, Lahiru Lowe, Sachith Gunasekara, and Yasiru Ratnayake
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
FEVER-256-24-gpt-4o-2024-05-13-989429 Dataset
Dataset Description
The dataset "dataset search for fact verification" is a generated dataset designed to support the development of domain specific embedding models for retrieval tasks.
Associated Model
This dataset was used to train the FEVER-256-24-gpt-4o-2024-05-13-989429 model.
How to Use
To use this dataset for model training or evaluation, you can load it using the Hugging Face datasets library as… See the full description on the dataset page: https://huggingface.co/datasets/fine-tuned/FEVER-256-24-gpt-4o-2024-05-13-989429.
Allen1222/Test-fine-tune dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for SaferDecoding Fine Tuning Dataset
This dataset aims to fine-tune models in an attempt to defend against jailbreak attacks. It is an extension of SafeDecoding
Dataset Details
Dataset Description
The dataset generation process was adapted from SafeDecoding. This dataset includes 252 original human-generated adversarial seed prompts, covering 18 harmful categories. This dataset includes responses generated by Llama2, Vicuna, Dolphin, Falcon… See the full description on the dataset page: https://huggingface.co/datasets/aspear/saferdecoding-fine-tuning.
TOOLVERIFIER: Generalization to New Tools via Self-Verification
This repository contains the ToolSelect dataset which was used to fine-tune Llama-2 70B for tool selection.
Data
ToolSelect data is synthetic training data generated for tool selection task using Llama-2 70B and Llama-2-Chat-70B. It consists of 555 samples corresponding to 173 tools. Each training sample is composed of a user instruction, a candidate set of tools that includes the ground truth tool, and a… See the full description on the dataset page: https://huggingface.co/datasets/facebook/toolverifier.
license: apache-2.0 task_categories: - feature-extraction - sentence-similarity language: - en tags: - sentence-transformers - feature-extraction - sentence-similarity - mteb - Events - Meetups - Networking - Community - Social pretty_name: event search for local meetups size_categories: - n<1K
jina-embeddings-v2-base-en-03052024-21on-webapp Dataset
Dataset Description
The dataset is a generated dataset designed to support the development of domain… See the full description on the dataset page: https://huggingface.co/datasets/fine-tuned/jina-embeddings-v2-base-en-03052024-21on-webapp.
Anthony3456347095/llama2-fine-tune-v2-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
E5-finetune Dataset
E5-finetune Dataset is a curated collection of query-passage pairs, encompassing a total of 870k examples. This dataset is specifically designed for fine-tuning models to extend their input length capabilities from 512 tokens to 1024 tokens. The primary focus is on accumulating long-context passages.
Dataset in English
The dataset samples long-context passage examples from various sources, ensuring a rich and diverse collection. The sources include:… See the full description on the dataset page: https://huggingface.co/datasets/ProfessorBob/E5-finetune-dataset.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
garystafford/fine-tune-nvidia-blackwell dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
withmuse/fine-tune-test dataset hosted on Hugging Face and contributed by the HF Datasets community
Dataset Card for "llama2-sst2-finetuning"
Dataset Description
The Llama2-sst2-fine-tuning dataset is designed for supervised fine-tuning of the LLaMA V2 based on the GLUE SST2 for sentiment analysis classification task.We provide two subsets: training and validation.To ensure the effectiveness of fine-tuning, we convert the data into the prompt template for LLaMA V2 supervised fine-tuning, where the data will follow this format:
[INST] <