Facebook
TwitterDataset for fine-tuning an embedding model for AI job search. Data sourced from datastax/linkedin_job_listings. Data used to fine-tune shawhin/distilroberta-ai-job-embeddings for AI job search. Links
GitHub Repo Video link Blog link
Facebook
TwitterThis dataset was created by takaito
Facebook
TwitterDataset for fine-tuning gemma-3-1b-it for function calling. The code and other resources for this project are linked below. Resources:
YouTube Video Blog Post GitHub Repo Fine-tuned Model | Original Model
Citation
If you find this dataset helpful, please cite: @dataset{talebi2025, author = {Shaw Talebi}, title = {tool-use-finetuning}, year = {2025}, publisher = {Hugging Face}, howpublished =… See the full description on the dataset page: https://huggingface.co/datasets/shawhin/tool-use-finetuning.
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Tool Finetuning Dataset
Dataset Description
Dataset Summary
This dataset is designed for fine-tuning language models to use tools (function calling) appropriately based on user queries. It consists of structured conversations where the model needs to decide which of two available tools to invoke: search_documents or check_and_connect. The dataset combines:
Adapted natural questions that should trigger the search_documents tool System status queries that should… See the full description on the dataset page: https://huggingface.co/datasets/asanchez75/tool_finetuning_dataset.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
Alpaca is the perfect dataset for fine-tuning your language models to better understand and follow instructions, capable of taking you beyond standard Natural Language Processing (NLP) abilities! This curated, cleaned dataset provides you with over 52,000 expertly crafted instructions and demonstrations generated by OpenAI's text-davinci-003 engine - all in English (BCP-47 en). Improve the quality of your language models with fields such as instruction, output, and input which have been designed to enhance every aspect of their comprehension. The data here has gone through rigorous cleaning to ensure there are no errors or biases present; allowing you to trust that this data will result in improved performance for any language model that uses it! Get ready to see what Alpaca can do for your NLP needs
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset provides a unique and valuable resource for anyone who wishes to create, develop and train language models. Alpaca provides users with 52,000 instruction-demonstration pairs generated by OpenAI's text-davinci-003 engine.
The data included in this dataset is formatted into 3 columns: “instruction”, “output” and “input.” All the data is written in English (BCP-47 en).
To make the most out of this dataset it is recommended to:
Familiarize yourself with the instructions in the instruction column as these provide guidance on how to use the other two columns; input and output.
Once comfortable with understanding the instructions columns move onto exploring what you are provided within each 14 sets of triplets – instruction, output and input – that are included in this clean version of Alpaca.
Read through many examples paying attention to any areas where you feel more clarification could be added or could be further improved upon for better understanding of language models however bear in mind that these examples have been cleaned from any errors or biases found from original dataset
Get inspired! As mentioned earlier there are more than 52k sets provided meaning having much flexibility for varying training strategies or unique approaches when creating your own language model!
Finally while not essential it may be helpful to have familiarity with OpenAI's text-davinci engine as well as enjoy playing around with different parameters/options depending on what type of outcomes you wish achieve
- Developing natural language processing (NLP) tasks that aim to better automate and interpret instructions given by humans.
- Training machine learning models of robotic agents to be able to understand natural language commands, as well as understand the correct action that needs to be taken in response.
- Creating a system that can generate personalized instructions and feedback in real time based on language models, catering specifically to each individual user's preferences or needs
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:----------------|:-------------------------------------------------------------------------| | instruction | This column contains the instructions for the language model. (Text) | | output | This column contains the expected output from the language model. (Text) | | input | This column contains the input given to the language model. (Text) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset contains prompts and responses for the particular prompt in 4 different domains namely, healthcare, telecom and banking. This dataset can be used to finetune various models like Llama, Phi and Gemma. After finetuning the model would be able to answer all the questions from the 4 above mentioned domains at a very high level.
A code has been attached with this dataset, where this dataset is used to train the Phi3.5 mini instruct model. It can be used as a reference to train your own model.
If you find any errors or scope of possible improvements, do let us know in the Discussions.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for SaferDecoding Fine Tuning Dataset
This dataset aims to fine-tune models in an attempt to defend against jailbreak attacks. It is an extension of SafeDecoding
Dataset Details
Dataset Description
The dataset generation process was adapted from SafeDecoding. This dataset includes 252 original human-generated adversarial seed prompts, covering 18 harmful categories. This dataset includes responses generated by Llama2, Vicuna, Dolphin, Falcon… See the full description on the dataset page: https://huggingface.co/datasets/aspear/saferdecoding-fine-tuning.
Facebook
TwitterDataset Card for "llama2-sst2-finetuning"
Dataset Description
The Llama2-sst2-fine-tuning dataset is designed for supervised fine-tuning of the LLaMA V2 based on the GLUE SST2 for sentiment analysis classification task.We provide two subsets: training and validation.To ensure the effectiveness of fine-tuning, we convert the data into the prompt template for LLaMA V2 supervised fine-tuning, where the data will follow this format:
[INST] <
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This Dataset is used to pre-train and fine-tune transformer network. To get details about how the data is collected and used visit, github repo. All the datas are collected by me except the data, 'original_hate_speech_data'. See the github repo for more detail !
Facebook
TwitterThis Install Package for LLM RAG, fine tuning essential library such as ( HuggingFace hub , transformer, langchain , evalate, sentence-transformers and etc. ) , suitable for Kaggle competition (offline) requirement which download form kaggle development environment.
Support Package list as below:
transformer
datasets
accelerate
bitsandbytes
langchain
langchain-community
sentence-transformers
chromadb
faiss-cpu
huggingface_hub
langchain-text-splitters
peft
trl
umap-learn
evaluate
deepeval
weave
Suggestion install command in kaggle: !pip install transformers --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/tranformers !pip install -U datasets --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/datasets !pip install -U accelerate --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/accelerate !pip install build --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/build-1.2.1-py3-none-any.whl !pip install -U bitsandbytes --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/bitsandbytes-0.43.1-py3-none-manylinux_2_24_x86_64.whl !pip install langchain --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/langchain-0.2.5-py3-none-any.whl !pip install langchain-core --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/langchain_core-0.2.9-py3-none-any.whl !pip install langsmith --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/langsmith-0.1.81-py3-none-any.whl !pip install langchain-community --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/langchain_community-0.2.5-py3-none-any.whl !pip install sentence-transformers --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/sentence_transformers-3.0.1-py3-none-any.whl !pip install chromadb --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/chromadb-0.5.3-py3-none-any.whl !pip install faiss-cpu --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/faiss_cpu-1.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl !pip install -U huggingface_hub --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/huggingface_hub !pip install -qU langchain-text-splitters --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/langchain_text_splitters-0.2.1-py3-none-any.whl !pip install -U peft --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/peft-0.11.1-py3-none-any.whl !pip install -U trl --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/trl-0.9.4-py3-none-any.whl !pip install umap-learn --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/umap-learn !pip install evaluate --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/evaluate-0.4.2-py3-none-any.whl !pip install deepeval --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/deepeval-0.21.59-py3-none-any.whl !pip install weave --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/weave-0.50.2-py3-none-any.whl
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset Card for investopedia-instruction-tuning dataset
We curate a dataset of substantial size pertaining to finance from Investopedia using a new technique that leverages unstructured scraping data and LLM to generate structured data that is suitable for fine-tuning embedding models. The dataset generation uses a new method of self-verification that ensures that the generated question-answer pairs and not hallucinated by the LLM with high probability.
Dataset… See the full description on the dataset page: https://huggingface.co/datasets/FinLang/investopedia-instruction-tuning-dataset.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by DesolationOfSmaug
Released under MIT
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The pretraining dataset is available at this link: HIT-TMG/KaLM-embedding-pretrain-data.
Languages
English, Chinese, Multilingual
Dataset Structure
Each in datasets is in the following format:
query, string, one query per sample pos, list[string], usually containing one positive example neg, list[string], usually containing seven negative examples
Dataset Summary
All these datasets have been preprocessed and can be used for finetuning your embedding models.… See the full description on the dataset page: https://huggingface.co/datasets/KaLM-Embedding/KaLM-embedding-finetuning-data.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
library_name: transformers
This model was fine-tuned as part of an artificial intelligence course at Gazi University in Ankara using a custom dataset created by the students and instructors. The model is optimized for a specific task, such as sentiment analysis or text classification, in the Turkish language.
bert-base-turkish-cased (example)The model can be directly used for tasks such as text classification, sentiment analysis, or other natural language processing tasks in Turkish.
The model can be integrated into larger ecosystems or more complex projects.
The model should not be used for unethical or malicious purposes. Additionally, it may have limited performance for multilingual tasks.
This model may inherit biases present in the training dataset. It is designed for English, and performance may degrade for other languages or domains outside its training data.
Users are advised to be aware of the model's limitations due to its training dataset and validate its results for their specific use case.
You can use the following code snippet to load and test the model:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# Load the model
model_name = "gazi-university/fine-tuned-turkish-model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Example input
text = "This AI model works perfectly!"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The dataset contains over 4 million+ logs written in 32 languages and is tailored for LLM training. It includes log and response pairs from 3 models, and is designed for language models and instruction fine-tuning to achieve improved performance in various NLP tasks - Get the data
Ukrainian, Turkish, Thai, Swedish, Slovak, Portuguese (Brazil), Portuguese, Polish, Persian, Dutch, Maratham, Malayalam, Korean, Japanese, Italian, Indonesian, Hungarian, Hindi, Irish, Greek, German, French, Finnish, Esperanto, English, Danish, Czech, Chinese, Catalan, Azerbaijani, Arabic
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22059654%2F29b247bdef4577f0f9ebcd9654f54e19%2Fllm.png?generation=1727425740959521&alt=media" alt="">
The dataset features a comprehensive training corpus with prompts and answers, suitable for generating text, question answering, and text classification. It enhances pre-trained LLMs, making it valuable for specific tasks, specific needs, and various generation tasks in the realm of language processing
Dataset has the following columns: - language: language the prompt is made in, - model: type of the model (GPT-3.5, GPT-4 and Uncensored GPT Version), - time: time when the answer was generated, - text: user's prompt, - response: response generated by the model
The text corpus supports instruction tuning and supervised fine-tuning for larger language models, enhancing text generation and human language understanding. With a focus on generating human-like content, it is useful for evaluating LLMs, improving generation capabilities, and performing well in classification tasks. This dataset also assists in mitigating biases, supporting longer texts, and optimizing LLM architectures for more effective language processing and language understanding.
Facebook
TwitterDataset Card for llama-2-banking-fine-tune
This dataset has been created with Argilla. As shown in the sections below, this dataset can be loaded into Argilla as explained in Load with Argilla, or used directly with the datasets library in Load with datasets.
Dataset Summary
This dataset contains:
A dataset configuration file conforming to the Argilla dataset format named argilla.yaml. This configuration file will be used to configure the dataset when using the… See the full description on the dataset page: https://huggingface.co/datasets/argilla/llama-2-banking-fine-tune.
Facebook
TwitterHere are almost all the packages you may need for LLM fine-tuning. If you find this helpful, PLEASE UPVOTE!
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The AstroChat dataset is a collection of 901 dialogues, synthetically generated, tailored to the specific domain of Astronautics / Space Mission Engineering. This dataset will be frequently updated following feedback from the community. If you would like to contribute, please reach out in the community discussion.
The dataset is intended to be used for supervised fine-tuning of chat LLMs (Large Language Models). Due to its currently limited size, you should use a pre-trained instruct model and ideally augment the AstroChat dataset with other datasets in the area of (Science Technology, Engineering and Math).
To be completed
python
from datasets import load_dataset
dataset = load_dataset("patrickfleith/AstroChat")901 generated conversations between a simulated user and AI-assistant (more on the generation method below). Each instance is made of the following field (column):
- id: a unique identifier to refer to this specific conversation. Useeful for traceability purposes, especially for further processing task or merge with other datasets.
- topic: a topic within the domain of Astronautics / Space Mission Engineering. This field is useful to filter the dataset by topic, or to create a topic-based split.
- subtopic: a subtopic of the topic. For instance in the topic of Propulsion, there are subtopics like Injector Design, Combustion Instability, Electric Propulsion, Chemical Propulsion, etc.
- persona: description of the persona used to simulate a user
- opening_question: the first question asked by the user to start a conversation with the AI-assistant
- messages: the whole conversation messages between the user and the AI assistant in already nicely formatted for rapid use with the transformers library. A list of messages where each message is a dictionary with the following fields:
- role: the role of the speaker, either user or assistant
- content: the message content. For the assistant, it is the answer to the user's question. For the user, it is the question asked to the assistant.
Important See the full list of topics and subtopics covered below.
Dataset is version controlled and commits history is available here: https://huggingface.co/datasets/patrickfleith/AstroChat/commits/main
We used a method inspired from Ultrachat dataset. Especially, we implemented our own version of Human-Model interaction from Sector I: Questions about the World of their paper:
Ding, N., Chen, Y., Xu, B., Qin, Y., Zheng, Z., Hu, S., ... & Zhou, B. (2023). Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233.
gpt-4-turbo model) to generate the answers to the opening questionsAll instances in the dataset are in english
901 synthetically-generated dialogue
AstroChat © 2024 by Patrick Fleith is licensed under Creative Commons Attribution 4.0 International
No restriction. Please provide the correct attribution following the license terms.
Patrick Fleith. (2024). AstroChat - A Dataset of synthetically generated conversations for LLM supervised fine-tuning in the domain of Space Mission Engineering and Astronautics (Version 1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.11531579
Will be updated based on feedbacks. I am also looking for contributors. Help me create more datasets for Space Engineering LLMs :)
Use the ...
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for Alpaca-Cleaned
Dataset Description
This is a IsiZulu translated version of the original Alpaca Dataset released by Stanford, Cosmopedia by HuggingFace, and Wikihow (Mahnaz et al,. 2018).
Original Alpaca Dataset Summary
Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow… See the full description on the dataset page: https://huggingface.co/datasets/ChallengerSpaceShuttle/zulu-finetuning-dataset.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Solution writeup: https://www.kaggle.com/competitions/llm-detect-ai-generated-text/discussion/470395
For training only: train_neg_list.pickle and train_pos_list.pickle are around 500,000 pairs for pretraining classifiers. train_df.csv is for last step finetuning. train_v4_drcat_01.csv can be downloaded from https://www.kaggle.com/datasets/thedrcat/daigt-v4-train-dataset
17/, 19/, 20/ in the code/ folder are for classifier pretraining, you need to run them first. _ft1/ are for finetuning on train_v4_drcat_01.csv _ft103/ are for finetuning on train_df.csv Please correct the input dirs in all the folders.
Inference kernel: https://www.kaggle.com/code/wowfattie/daigt-2nd-place
In case you are interested in how to generate train_neg_list.pickle and train_pos_list.pickle, everything is in the gaigtdatagenerationforpretrain/ folder. Perform the following steps: 1) Download the SlimPajama dataset. 2) Run preprocess_external_chunk1-10.py for file selection and random chunking. Only C4 subset was used, only files with word length > 2048 was used, because I want to make sure the LLMs have 1024 tokens as prompt and generate the next 1024 tokens. 3) Run the python files in every folders with LLM names. Note that some files may error because I forgot to add padding. I was able to only run roughly 90% of those files. 4) Run split1.py for assembling.
If you are interested in how to generate the train_df.csv, go to daigtdatagenerationforfinetune/ and perform the following steps: 1) Install h2o-llmstudio 2) Run prepare_data_5_promts.py to generate input file for finetuning. Only essays of the 5 prompts in test set were included. 3) Perform finetuning. The config files are in the folders inside llmstudio_configs/. Those folders are named after the LLMs used. 4) Run all the python files with LLM names 5) Run prepare_train_data.py for assembling.
Facebook
TwitterDataset for fine-tuning an embedding model for AI job search. Data sourced from datastax/linkedin_job_listings. Data used to fine-tune shawhin/distilroberta-ai-job-embeddings for AI job search. Links
GitHub Repo Video link Blog link