100+ datasets found

h
ai-job-embedding-finetuning
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shawhin Talebi, ai-job-embedding-finetuning [Dataset]. https://huggingface.co/datasets/shawhin/ai-job-embedding-finetuning
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Shawhin Talebi
Description
Dataset for fine-tuning an embedding model for AI job search. Data sourced from datastax/linkedin_job_listings. Data used to fine-tune shawhin/distilroberta-ai-job-embeddings for AI job search. Links

GitHub Repo Video link Blog link
Detect AI fine tuning models
kaggle.com
zip
Updated Jan 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
takaito (2024). Detect AI fine tuning models [Dataset]. https://www.kaggle.com/datasets/takaito/detect-ai-fine-tuning-models
Explore at:
zip(5276930873 bytes)Available download formats
Dataset updated
Jan 21, 2024
Authors
takaito
Description
Dataset

This dataset was created by takaito

Contents
h
tool-use-finetuning
huggingface.co
Updated Jul 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shawhin Talebi (2025). tool-use-finetuning [Dataset]. https://huggingface.co/datasets/shawhin/tool-use-finetuning
Explore at:
Dataset updated
Jul 27, 2025
Authors
Shawhin Talebi
Description
Dataset for fine-tuning gemma-3-1b-it for function calling. The code and other resources for this project are linked below. Resources:

YouTube Video Blog Post GitHub Repo Fine-tuned Model | Original Model

Citation

If you find this dataset helpful, please cite: @dataset{talebi2025, author = {Shaw Talebi}, title = {tool-use-finetuning}, year = {2025}, publisher = {Hugging Face}, howpublished =… See the full description on the dataset page: https://huggingface.co/datasets/shawhin/tool-use-finetuning.
h
tool_finetuning_dataset
huggingface.co
Updated May 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adam (2025). tool_finetuning_dataset [Dataset]. https://huggingface.co/datasets/asanchez75/tool_finetuning_dataset
Explore at:
Dataset updated
May 18, 2025
Authors
Adam
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Tool Finetuning Dataset

Dataset Description Dataset Summary

This dataset is designed for fine-tuning language models to use tools (function calling) appropriately based on user queries. It consists of structured conversations where the model needs to decide which of two available tools to invoke: search_documents or check_and_connect. The dataset combines:

Adapted natural questions that should trigger the search_documents tool System status queries that should… See the full description on the dataset page: https://huggingface.co/datasets/asanchez75/tool_finetuning_dataset.
Alpaca Cleaned
kaggle.com
zip
Updated Nov 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Alpaca Cleaned [Dataset]. https://www.kaggle.com/datasets/thedevastator/alpaca-language-instruction-training/code
Explore at:
zip(14548320 bytes)Available download formats
Dataset updated
Nov 26, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Alpaca Cleaned

Improving Pretrained Language Model Understanding

By Huggingface Hub [source]

About this dataset

Alpaca is the perfect dataset for fine-tuning your language models to better understand and follow instructions, capable of taking you beyond standard Natural Language Processing (NLP) abilities! This curated, cleaned dataset provides you with over 52,000 expertly crafted instructions and demonstrations generated by OpenAI's text-davinci-003 engine - all in English (BCP-47 en). Improve the quality of your language models with fields such as instruction, output, and input which have been designed to enhance every aspect of their comprehension. The data here has gone through rigorous cleaning to ensure there are no errors or biases present; allowing you to trust that this data will result in improved performance for any language model that uses it! Get ready to see what Alpaca can do for your NLP needs

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset provides a unique and valuable resource for anyone who wishes to create, develop and train language models. Alpaca provides users with 52,000 instruction-demonstration pairs generated by OpenAI's text-davinci-003 engine.

The data included in this dataset is formatted into 3 columns: “instruction”, “output” and “input.” All the data is written in English (BCP-47 en).

To make the most out of this dataset it is recommended to:

Familiarize yourself with the instructions in the instruction column as these provide guidance on how to use the other two columns; input and output.

Once comfortable with understanding the instructions columns move onto exploring what you are provided within each 14 sets of triplets – instruction, output and input – that are included in this clean version of Alpaca.

Read through many examples paying attention to any areas where you feel more clarification could be added or could be further improved upon for better understanding of language models however bear in mind that these examples have been cleaned from any errors or biases found from original dataset

Get inspired! As mentioned earlier there are more than 52k sets provided meaning having much flexibility for varying training strategies or unique approaches when creating your own language model!

Finally while not essential it may be helpful to have familiarity with OpenAI's text-davinci engine as well as enjoy playing around with different parameters/options depending on what type of outcomes you wish achieve

Research Ideas

Developing natural language processing (NLP) tasks that aim to better automate and interpret instructions given by humans.

Training machine learning models of robotic agents to be able to understand natural language commands, as well as understand the correct action that needs to be taken in response.

Creating a system that can generate personalized instructions and feedback in real time based on language models, catering specifically to each individual user's preferences or needs

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:----------------|:-------------------------------------------------------------------------| | instruction | This column contains the instructions for the language model. (Text) | | output | This column contains the expected output from the language model. (Text) | | input | This column contains the input given to the language model. (Text) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Training Data for Multiability Chabot
kaggle.com
Updated Mar 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ankit Dutta (2025). Training Data for Multiability Chabot [Dataset]. https://www.kaggle.com/datasets/ankitd7752/training-data-for-multiability-chabot
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 2, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ankit Dutta
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset contains prompts and responses for the particular prompt in 4 different domains namely, healthcare, telecom and banking. This dataset can be used to finetune various models like Llama, Phi and Gemma. After finetuning the model would be able to answer all the questions from the 4 above mentioned domains at a very high level.

A code has been attached with this dataset, where this dataset is used to train the Phi3.5 mini instruct model. It can be used as a reference to train your own model.

If you find any errors or scope of possible improvements, do let us know in the Discussions.
h
saferdecoding-fine-tuning
huggingface.co
Updated Dec 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anders Spear (2024). saferdecoding-fine-tuning [Dataset]. https://huggingface.co/datasets/aspear/saferdecoding-fine-tuning
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 19, 2024
Authors
Anders Spear
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for SaferDecoding Fine Tuning Dataset

This dataset aims to fine-tune models in an attempt to defend against jailbreak attacks. It is an extension of SafeDecoding

Dataset Details Dataset Description

The dataset generation process was adapted from SafeDecoding. This dataset includes 252 original human-generated adversarial seed prompts, covering 18 harmful categories. This dataset includes responses generated by Llama2, Vicuna, Dolphin, Falcon… See the full description on the dataset page: https://huggingface.co/datasets/aspear/saferdecoding-fine-tuning.
h
llama2-sst2-fine-tuning
huggingface.co
Updated Aug 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yifei (2023). llama2-sst2-fine-tuning [Dataset]. https://huggingface.co/datasets/OneFly7/llama2-sst2-fine-tuning
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 2, 2023
Authors
Yifei
Description
Dataset Card for "llama2-sst2-finetuning"

Dataset Description

The Llama2-sst2-fine-tuning dataset is designed for supervised fine-tuning of the LLaMA V2 based on the GLUE SST2 for sentiment analysis classification task.We provide two subsets: training and validation.To ensure the effectiveness of fine-tuning, we convert the data into the prompt template for LLaMA V2 supervised fine-tuning, where the data will follow this format:
[INST] <
Amharic Transformer pre-train and Fine-tuning data
kaggle.com
zip
Updated Apr 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahder Tesfaye Abebe (2025). Amharic Transformer pre-train and Fine-tuning data [Dataset]. https://www.kaggle.com/datasets/mahdertesfayeabebe/amharic-transformer-pre-train-and-fine-tuning-data
Explore at:
zip(117690980 bytes)Available download formats
Dataset updated
Apr 2, 2025
Authors
Mahder Tesfaye Abebe
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This Dataset is used to pre-train and fine-tune transformer network. To get details about how the data is collected and used visit, github repo. All the datas are collected by me except the data, 'original_hate_speech_data'. See the github repo for more detail !
AI-MATH-LLM-Package
kaggle.com
zip
Updated Jun 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Johnson chong (2024). AI-MATH-LLM-Package [Dataset]. https://www.kaggle.com/datasets/johnsonhk88/ai-math-llm-package
Explore at:
zip(3330554065 bytes)Available download formats
Dataset updated
Jun 20, 2024
Authors
Johnson chong
Description
This Install Package for LLM RAG, fine tuning essential library such as ( HuggingFace hub , transformer, langchain , evalate, sentence-transformers and etc. ) , suitable for Kaggle competition (offline) requirement which download form kaggle development environment.

Support Package list as below: transformer datasets accelerate bitsandbytes langchain langchain-community sentence-transformers chromadb
faiss-cpu huggingface_hub langchain-text-splitters
peft trl umap-learn evaluate deepeval weave

Suggestion install command in kaggle: !pip install transformers --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/tranformers !pip install -U datasets --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/datasets !pip install -U accelerate --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/accelerate !pip install build --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/build-1.2.1-py3-none-any.whl !pip install -U bitsandbytes --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/bitsandbytes-0.43.1-py3-none-manylinux_2_24_x86_64.whl !pip install langchain --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/langchain-0.2.5-py3-none-any.whl !pip install langchain-core --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/langchain_core-0.2.9-py3-none-any.whl !pip install langsmith --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/langsmith-0.1.81-py3-none-any.whl !pip install langchain-community --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/langchain_community-0.2.5-py3-none-any.whl !pip install sentence-transformers --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/sentence_transformers-3.0.1-py3-none-any.whl !pip install chromadb --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/chromadb-0.5.3-py3-none-any.whl !pip install faiss-cpu --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/faiss_cpu-1.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl !pip install -U huggingface_hub --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/huggingface_hub !pip install -qU langchain-text-splitters --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/langchain_text_splitters-0.2.1-py3-none-any.whl !pip install -U peft --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/peft-0.11.1-py3-none-any.whl !pip install -U trl --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/trl-0.9.4-py3-none-any.whl !pip install umap-learn --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/umap-learn !pip install evaluate --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/evaluate-0.4.2-py3-none-any.whl !pip install deepeval --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/deepeval-0.21.59-py3-none-any.whl !pip install weave --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/weave-0.50.2-py3-none-any.whl
h
investopedia-instruction-tuning-dataset
huggingface.co
Updated Jul 26, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FinLang (2023). investopedia-instruction-tuning-dataset [Dataset]. https://huggingface.co/datasets/FinLang/investopedia-instruction-tuning-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 26, 2023
Dataset authored and provided by
FinLang
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset Card for investopedia-instruction-tuning dataset

We curate a dataset of substantial size pertaining to finance from Investopedia using a new technique that leverages unstructured scraping data and LLM to generate structured data that is suitable for fine-tuning embedding models. The dataset generation uses a new method of self-verification that ensures that the generated question-answer pairs and not hallucinated by the LLM with high probability.

Dataset… See the full description on the dataset page: https://huggingface.co/datasets/FinLang/investopedia-instruction-tuning-dataset.
Webpage Element Detection - COCO JSON
kaggle.com
zip
Updated Apr 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DesolationOfSmaug (2024). Webpage Element Detection - COCO JSON [Dataset]. https://www.kaggle.com/datasets/desolationofsmaug/webpage-element-detection
Explore at:
zip(55308148 bytes)Available download formats
Dataset updated
Apr 20, 2024
Authors
DesolationOfSmaug
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

This dataset was created by DesolationOfSmaug

Released under MIT

Contents
h
KaLM-embedding-finetuning-data
huggingface.co
Updated Oct 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KaLM-Embedding (2025). KaLM-embedding-finetuning-data [Dataset]. https://huggingface.co/datasets/KaLM-Embedding/KaLM-embedding-finetuning-data
Explore at:
Dataset updated
Oct 8, 2025
Dataset authored and provided by
KaLM-Embedding
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The pretraining dataset is available at this link: HIT-TMG/KaLM-embedding-pretrain-data.

Languages

English, Chinese, Multilingual

Dataset Structure

Each in datasets is in the following format:

query, string, one query per sample pos, list[string], usually containing one positive example neg, list[string], usually containing seven negative examples

Dataset Summary

All these datasets have been preprocessed and can be used for finetuning your embedding models.… See the full description on the dataset page: https://huggingface.co/datasets/KaLM-Embedding/KaLM-embedding-finetuning-data.
Grammar Correction Dataset for Fine-Tuning
kaggle.com
zip
Updated Jan 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nezahat Korkmaz (2025). Grammar Correction Dataset for Fine-Tuning [Dataset]. https://www.kaggle.com/datasets/nezahatkk/grammar-correction-dataset-for-fine-tuning
Explore at:
zip(180270 bytes)Available download formats
Dataset updated
Jan 28, 2025
Authors
Nezahat Korkmaz
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
library_name: transformers

tags: [fine-tuning, custom-dataset, educational-use, NLP, transformers]

Model Card for Fine-Tuned Transformers Model

Model Details

Model Description

This model was fine-tuned as part of an artificial intelligence course at Gazi University in Ankara using a custom dataset created by the students and instructors. The model is optimized for a specific task, such as sentiment analysis or text classification, in the Turkish language.

Developed by: Gazi University AI Course Team

Funded by [optional]: Gazi University

Shared by [optional]: Faculty Members and Students

Model type: Transformers-based language model (e.g., BERT or GPT)

Language(s) (NLP): Turkish

License: [CC BY-SA 4.0 or other appropriate license]

Finetuned from model [optional]: bert-base-turkish-cased (example)

Model Sources [optional]

Dataset that we used for fine-tuning: [https://www.kaggle.com/datasets/nezahatkk/grammar-correction-dataset-for-fine-tuning]

Uses

Direct Use

The model can be directly used for tasks such as text classification, sentiment analysis, or other natural language processing tasks in Turkish.

Downstream Use [optional]

The model can be integrated into larger ecosystems or more complex projects.

Out-of-Scope Use

The model should not be used for unethical or malicious purposes. Additionally, it may have limited performance for multilingual tasks.

Bias, Risks, and Limitations

This model may inherit biases present in the training dataset. It is designed for English, and performance may degrade for other languages or domains outside its training data.

Recommendations

Users are advised to be aware of the model's limitations due to its training dataset and validate its results for their specific use case.

How to Get Started with the Model

You can use the following code snippet to load and test the model:

from transformers import AutoTokenizer, AutoModelForSequenceClassification # Load the model model_name = "gazi-university/fine-tuned-turkish-model" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) # Example input text = "This AI model works perfectly!" inputs = tokenizer(text, return_tensors="pt") outputs = model(**inputs)
LLM Text Generation Dataset
kaggle.com
zip
Updated Jun 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Unidata (2025). LLM Text Generation Dataset [Dataset]. https://www.kaggle.com/datasets/unidpro/llm-training-dataset/discussion
Explore at:
zip(543652 bytes)Available download formats
Dataset updated
Jun 10, 2025
Authors
Unidata
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
LLM Fine-Tuning Dataset - Question Answering

The dataset contains over 4 million+ logs written in 32 languages and is tailored for LLM training. It includes log and response pairs from 3 models, and is designed for language models and instruction fine-tuning to achieve improved performance in various NLP tasks - Get the data

Models used for text generation:

GPT-3.5

GPT-4

Uncensored GPT Version (is not included inthe sample)

Languages in the dataset:

Ukrainian, Turkish, Thai, Swedish, Slovak, Portuguese (Brazil), Portuguese, Polish, Persian, Dutch, Maratham, Malayalam, Korean, Japanese, Italian, Indonesian, Hungarian, Hindi, Irish, Greek, German, French, Finnish, Esperanto, English, Danish, Czech, Chinese, Catalan, Azerbaijani, Arabic

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22059654%2F29b247bdef4577f0f9ebcd9654f54e19%2Fllm.png?generation=1727425740959521&alt=media" alt="">

The dataset features a comprehensive training corpus with prompts and answers, suitable for generating text, question answering, and text classification. It enhances pre-trained LLMs, making it valuable for specific tasks, specific needs, and various generation tasks in the realm of language processing

💵 Buy the Dataset: This is a limited preview of the data. To access the full dataset, please contact us at https://unidata.pro to discuss your requirements and pricing options.

Content

Dataset has the following columns: - language: language the prompt is made in, - model: type of the model (GPT-3.5, GPT-4 and Uncensored GPT Version), - time: time when the answer was generated, - text: user's prompt, - response: response generated by the model

The text corpus supports instruction tuning and supervised fine-tuning for larger language models, enhancing text generation and human language understanding. With a focus on generating human-like content, it is useful for evaluating LLMs, improving generation capabilities, and performing well in classification tasks. This dataset also assists in mitigating biases, supporting longer texts, and optimizing LLM architectures for more effective language processing and language understanding.

🌐 UniData provides high-quality datasets, content moderation, data collection and annotation for your AI/ML projects
h
llama-2-banking-fine-tune
huggingface.co
Updated Jul 28, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Argilla (2023). llama-2-banking-fine-tune [Dataset]. https://huggingface.co/datasets/argilla/llama-2-banking-fine-tune
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 28, 2023
Dataset authored and provided by
Argilla
Description
Dataset Card for llama-2-banking-fine-tune

This dataset has been created with Argilla. As shown in the sections below, this dataset can be loaded into Argilla as explained in Load with Argilla, or used directly with the datasets library in Load with datasets.

Dataset Summary

This dataset contains:

A dataset configuration file conforming to the Argilla dataset format named argilla.yaml. This configuration file will be used to configure the dataset when using the… See the full description on the dataset page: https://huggingface.co/datasets/argilla/llama-2-banking-fine-tune.
LLM-whls-fine-tuning
kaggle.com
zip
Updated May 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Haofeng Xu (2024). LLM-whls-fine-tuning [Dataset]. https://www.kaggle.com/datasets/haofengxace/llm-whls-fine-tuning
Explore at:
zip(113058238 bytes)Available download formats
Dataset updated
May 23, 2024
Authors
Haofeng Xu
Description
Here are almost all the packages you may need for LLM fine-tuning. If you find this helpful, PLEASE UPVOTE!
Data from: AstroChat
kaggle.com
huggingface.co
zip
Updated Jun 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
astro_pat (2024). AstroChat [Dataset]. https://www.kaggle.com/datasets/patrickfleith/astrochat
Explore at:
zip(1214166 bytes)Available download formats
Dataset updated
Jun 9, 2024
Authors
astro_pat
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Purpose and Scope

The AstroChat dataset is a collection of 901 dialogues, synthetically generated, tailored to the specific domain of Astronautics / Space Mission Engineering. This dataset will be frequently updated following feedback from the community. If you would like to contribute, please reach out in the community discussion.

Intended Use

The dataset is intended to be used for supervised fine-tuning of chat LLMs (Large Language Models). Due to its currently limited size, you should use a pre-trained instruct model and ideally augment the AstroChat dataset with other datasets in the area of (Science Technology, Engineering and Math).

Quickstart

To be completed

DATASET DESCRIPTION

Access

Manual download from Hugging face hub: https://huggingface.co/datasets/patrickfleith/AstroChat

Or with python: python from datasets import load_dataset dataset = load_dataset("patrickfleith/AstroChat")

Structure

901 generated conversations between a simulated user and AI-assistant (more on the generation method below). Each instance is made of the following field (column): - id: a unique identifier to refer to this specific conversation. Useeful for traceability purposes, especially for further processing task or merge with other datasets. - topic: a topic within the domain of Astronautics / Space Mission Engineering. This field is useful to filter the dataset by topic, or to create a topic-based split. - subtopic: a subtopic of the topic. For instance in the topic of Propulsion, there are subtopics like Injector Design, Combustion Instability, Electric Propulsion, Chemical Propulsion, etc. - persona: description of the persona used to simulate a user - opening_question: the first question asked by the user to start a conversation with the AI-assistant - messages: the whole conversation messages between the user and the AI assistant in already nicely formatted for rapid use with the transformers library. A list of messages where each message is a dictionary with the following fields: - role: the role of the speaker, either user or assistant - content: the message content. For the assistant, it is the answer to the user's question. For the user, it is the question asked to the assistant.

Important See the full list of topics and subtopics covered below.

Metadata

Dataset is version controlled and commits history is available here: https://huggingface.co/datasets/patrickfleith/AstroChat/commits/main

Generation Method

We used a method inspired from Ultrachat dataset. Especially, we implemented our own version of Human-Model interaction from Sector I: Questions about the World of their paper:

Ding, N., Chen, Y., Xu, B., Qin, Y., Zheng, Z., Hu, S., ... & Zhou, B. (2023). Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233.

Step-by-step description

Defined a set of user persona

Defined a set of topics/ disciplines within the domain of Astronautics / Space Mission Engineering

For each topics, we defined a set of subtopics to narrow down the conversation to more specific and niche conversations (see below the full list)

For each subtopic we generate a set of opening questions that the user could ask to start a conversation (see below the full list)

We then distil the knowledge of an strong Chat Model (in our case ChatGPT through then api with gpt-4-turbo model) to generate the answers to the opening questions

We simulate follow-up questions from the user to the assistant, and the assistant's answers to these questions which builds up the messages.

Future work and contributions appreciated

Distil knowledge from more models (Anthropic, Mixtral, GPT-4o, etc...)

Implement more creativity in the opening questions and follow-up questions

Filter-out questions and conversations which are too similar

Ask topic and subtopic expert to validate the generated conversations to have a sense on how reliable is the overall dataset

Languages

All instances in the dataset are in english

Size

901 synthetically-generated dialogue

USAGE AND GUIDELINES

License

AstroChat © 2024 by Patrick Fleith is licensed under Creative Commons Attribution 4.0 International

Restrictions

No restriction. Please provide the correct attribution following the license terms.

Citation

Patrick Fleith. (2024). AstroChat - A Dataset of synthetically generated conversations for LLM supervised fine-tuning in the domain of Space Mission Engineering and Astronautics (Version 1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.11531579

Update Frequency

Will be updated based on feedbacks. I am also looking for contributors. Help me create more datasets for Space Engineering LLMs :)

Have a feedback or spot an error?

Use the ...
h
zulu-finetuning-dataset
huggingface.co
Updated Aug 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hector Motsepe (2024). zulu-finetuning-dataset [Dataset]. https://huggingface.co/datasets/ChallengerSpaceShuttle/zulu-finetuning-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 29, 2024
Authors
Hector Motsepe
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for Alpaca-Cleaned

Dataset Description

This is a IsiZulu translated version of the original Alpaca Dataset released by Stanford, Cosmopedia by HuggingFace, and Wikihow (Mahnaz et al,. 2018).

Original Alpaca Dataset Summary

Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow… See the full description on the dataset page: https://huggingface.co/datasets/ChallengerSpaceShuttle/zulu-finetuning-dataset.
daigtdataandcode
kaggle.com
zip
Updated Feb 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Guanshuo Xu (2024). daigtdataandcode [Dataset]. https://www.kaggle.com/datasets/wowfattie/daigtpretraindata
Explore at:
zip(2253692051 bytes)Available download formats
Dataset updated
Feb 5, 2024
Authors
Guanshuo Xu
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Solution writeup: https://www.kaggle.com/competitions/llm-detect-ai-generated-text/discussion/470395

For training only: train_neg_list.pickle and train_pos_list.pickle are around 500,000 pairs for pretraining classifiers. train_df.csv is for last step finetuning. train_v4_drcat_01.csv can be downloaded from https://www.kaggle.com/datasets/thedrcat/daigt-v4-train-dataset

17/, 19/, 20/ in the code/ folder are for classifier pretraining, you need to run them first. _ft1/ are for finetuning on train_v4_drcat_01.csv _ft103/ are for finetuning on train_df.csv Please correct the input dirs in all the folders.

Inference kernel: https://www.kaggle.com/code/wowfattie/daigt-2nd-place

In case you are interested in how to generate train_neg_list.pickle and train_pos_list.pickle, everything is in the gaigtdatagenerationforpretrain/ folder. Perform the following steps: 1) Download the SlimPajama dataset. 2) Run preprocess_external_chunk1-10.py for file selection and random chunking. Only C4 subset was used, only files with word length > 2048 was used, because I want to make sure the LLMs have 1024 tokens as prompt and generate the next 1024 tokens. 3) Run the python files in every folders with LLM names. Note that some files may error because I forgot to add padding. I was able to only run roughly 90% of those files. 4) Run split1.py for assembling.

If you are interested in how to generate the train_df.csv, go to daigtdatagenerationforfinetune/ and perform the following steps: 1) Install h2o-llmstudio 2) Run prepare_data_5_promts.py to generate input file for finetuning. Only essays of the 5 prompts in test set were included. 3) Perform finetuning. The config files are in the folders inside llmstudio_configs/. Those folders are named after the LLMs used. 4) Run all the python files with LLM names 5) Run prepare_train_data.py for assembling.

Facebook

Twitter

Click to copy link

Link copied

Cite

Shawhin Talebi, ai-job-embedding-finetuning [Dataset]. https://huggingface.co/datasets/shawhin/ai-job-embedding-finetuning

ai-job-embedding-finetuning

shawhin/ai-job-embedding-finetuning

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Authors

Shawhin Talebi

Description

Dataset for fine-tuning an embedding model for AI job search. Data sourced from datastax/linkedin_job_listings. Data used to fine-tune shawhin/distilroberta-ai-job-embeddings for AI job search. Links

GitHub Repo Video link Blog link

Clear search

Close search

Google apps

Main menu

ai-job-embedding-finetuning

Detect AI fine tuning models

Dataset

Contents

tool-use-finetuning

tool_finetuning_dataset

Alpaca Cleaned

Alpaca Cleaned

Improving Pretrained Language Model Understanding

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

Training Data for Multiability Chabot

saferdecoding-fine-tuning

llama2-sst2-fine-tuning

Amharic Transformer pre-train and Fine-tuning data

AI-MATH-LLM-Package

investopedia-instruction-tuning-dataset

Webpage Element Detection - COCO JSON

Dataset

Contents

KaLM-embedding-finetuning-data

Grammar Correction Dataset for Fine-Tuning

tags: [fine-tuning, custom-dataset, educational-use, NLP, transformers]

Model Card for Fine-Tuned Transformers Model

Model Details

Model Description

Model Sources [optional]

Uses

Direct Use

Downstream Use [optional]

Out-of-Scope Use

Bias, Risks, and Limitations

Recommendations

How to Get Started with the Model

LLM Text Generation Dataset

LLM Fine-Tuning Dataset - Question Answering

Models used for text generation:

Languages in the dataset:

💵 Buy the Dataset: This is a limited preview of the data. To access the full dataset, please contact us at https://unidata.pro to discuss your requirements and pricing options.

Content

🌐 UniData provides high-quality datasets, content moderation, data collection and annotation for your AI/ML projects

llama-2-banking-fine-tune

LLM-whls-fine-tuning

Data from: AstroChat

Purpose and Scope

Intended Use

Quickstart

DATASET DESCRIPTION

Access

Structure

Metadata

Generation Method

Step-by-step description

Future work and contributions appreciated

Languages

Size

USAGE AND GUIDELINES

License

Restrictions

Citation

Update Frequency

Have a feedback or spot an error?

zulu-finetuning-dataset

daigtdataandcode

ai-job-embedding-finetuning

shawhin/ai-job-embedding-finetuning