100+ datasets found
  1. h

    ai-job-embedding-finetuning

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shawhin Talebi, ai-job-embedding-finetuning [Dataset]. https://huggingface.co/datasets/shawhin/ai-job-embedding-finetuning
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Shawhin Talebi
    Description

    Dataset for fine-tuning an embedding model for AI job search. Data sourced from datastax/linkedin_job_listings. Data used to fine-tune shawhin/distilroberta-ai-job-embeddings for AI job search. Links

    GitHub Repo Video link Blog link

  2. Detect AI fine tuning models

    • kaggle.com
    zip
    Updated Jan 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    takaito (2024). Detect AI fine tuning models [Dataset]. https://www.kaggle.com/datasets/takaito/detect-ai-fine-tuning-models
    Explore at:
    zip(5276930873 bytes)Available download formats
    Dataset updated
    Jan 21, 2024
    Authors
    takaito
    Description

    Dataset

    This dataset was created by takaito

    Contents

  3. h

    tool-use-finetuning

    • huggingface.co
    Updated Jul 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shawhin Talebi (2025). tool-use-finetuning [Dataset]. https://huggingface.co/datasets/shawhin/tool-use-finetuning
    Explore at:
    Dataset updated
    Jul 27, 2025
    Authors
    Shawhin Talebi
    Description

    Dataset for fine-tuning gemma-3-1b-it for function calling. The code and other resources for this project are linked below. Resources:

    YouTube Video Blog Post GitHub Repo Fine-tuned Model | Original Model

      Citation
    

    If you find this dataset helpful, please cite: @dataset{talebi2025, author = {Shaw Talebi}, title = {tool-use-finetuning}, year = {2025}, publisher = {Hugging Face}, howpublished =… See the full description on the dataset page: https://huggingface.co/datasets/shawhin/tool-use-finetuning.

  4. h

    tool_finetuning_dataset

    • huggingface.co
    Updated May 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adam (2025). tool_finetuning_dataset [Dataset]. https://huggingface.co/datasets/asanchez75/tool_finetuning_dataset
    Explore at:
    Dataset updated
    May 18, 2025
    Authors
    Adam
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Tool Finetuning Dataset

      Dataset Description
    
    
    
    
    
      Dataset Summary
    

    This dataset is designed for fine-tuning language models to use tools (function calling) appropriately based on user queries. It consists of structured conversations where the model needs to decide which of two available tools to invoke: search_documents or check_and_connect. The dataset combines:

    Adapted natural questions that should trigger the search_documents tool System status queries that should… See the full description on the dataset page: https://huggingface.co/datasets/asanchez75/tool_finetuning_dataset.

  5. Alpaca Cleaned

    • kaggle.com
    zip
    Updated Nov 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Alpaca Cleaned [Dataset]. https://www.kaggle.com/datasets/thedevastator/alpaca-language-instruction-training/code
    Explore at:
    zip(14548320 bytes)Available download formats
    Dataset updated
    Nov 26, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Alpaca Cleaned

    Improving Pretrained Language Model Understanding

    By Huggingface Hub [source]

    About this dataset

    Alpaca is the perfect dataset for fine-tuning your language models to better understand and follow instructions, capable of taking you beyond standard Natural Language Processing (NLP) abilities! This curated, cleaned dataset provides you with over 52,000 expertly crafted instructions and demonstrations generated by OpenAI's text-davinci-003 engine - all in English (BCP-47 en). Improve the quality of your language models with fields such as instruction, output, and input which have been designed to enhance every aspect of their comprehension. The data here has gone through rigorous cleaning to ensure there are no errors or biases present; allowing you to trust that this data will result in improved performance for any language model that uses it! Get ready to see what Alpaca can do for your NLP needs

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset provides a unique and valuable resource for anyone who wishes to create, develop and train language models. Alpaca provides users with 52,000 instruction-demonstration pairs generated by OpenAI's text-davinci-003 engine.

    The data included in this dataset is formatted into 3 columns: “instruction”, “output” and “input.” All the data is written in English (BCP-47 en).

    To make the most out of this dataset it is recommended to:

    • Familiarize yourself with the instructions in the instruction column as these provide guidance on how to use the other two columns; input and output.

    • Once comfortable with understanding the instructions columns move onto exploring what you are provided within each 14 sets of triplets – instruction, output and input – that are included in this clean version of Alpaca.

    • Read through many examples paying attention to any areas where you feel more clarification could be added or could be further improved upon for better understanding of language models however bear in mind that these examples have been cleaned from any errors or biases found from original dataset

    • Get inspired! As mentioned earlier there are more than 52k sets provided meaning having much flexibility for varying training strategies or unique approaches when creating your own language model!

    • Finally while not essential it may be helpful to have familiarity with OpenAI's text-davinci engine as well as enjoy playing around with different parameters/options depending on what type of outcomes you wish achieve

    Research Ideas

    • Developing natural language processing (NLP) tasks that aim to better automate and interpret instructions given by humans.
    • Training machine learning models of robotic agents to be able to understand natural language commands, as well as understand the correct action that needs to be taken in response.
    • Creating a system that can generate personalized instructions and feedback in real time based on language models, catering specifically to each individual user's preferences or needs

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: train.csv | Column name | Description | |:----------------|:-------------------------------------------------------------------------| | instruction | This column contains the instructions for the language model. (Text) | | output | This column contains the expected output from the language model. (Text) | | input | This column contains the input given to the language model. (Text) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

  6. Training Data for Multiability Chabot

    • kaggle.com
    Updated Mar 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ankit Dutta (2025). Training Data for Multiability Chabot [Dataset]. https://www.kaggle.com/datasets/ankitd7752/training-data-for-multiability-chabot
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 2, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ankit Dutta
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset contains prompts and responses for the particular prompt in 4 different domains namely, healthcare, telecom and banking. This dataset can be used to finetune various models like Llama, Phi and Gemma. After finetuning the model would be able to answer all the questions from the 4 above mentioned domains at a very high level.

    A code has been attached with this dataset, where this dataset is used to train the Phi3.5 mini instruct model. It can be used as a reference to train your own model.

    If you find any errors or scope of possible improvements, do let us know in the Discussions.

  7. h

    saferdecoding-fine-tuning

    • huggingface.co
    Updated Dec 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anders Spear (2024). saferdecoding-fine-tuning [Dataset]. https://huggingface.co/datasets/aspear/saferdecoding-fine-tuning
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 19, 2024
    Authors
    Anders Spear
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for SaferDecoding Fine Tuning Dataset

    This dataset aims to fine-tune models in an attempt to defend against jailbreak attacks. It is an extension of SafeDecoding

      Dataset Details
    
    
    
    
    
      Dataset Description
    

    The dataset generation process was adapted from SafeDecoding. This dataset includes 252 original human-generated adversarial seed prompts, covering 18 harmful categories. This dataset includes responses generated by Llama2, Vicuna, Dolphin, Falcon… See the full description on the dataset page: https://huggingface.co/datasets/aspear/saferdecoding-fine-tuning.

  8. h

    llama2-sst2-fine-tuning

    • huggingface.co
    Updated Aug 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yifei (2023). llama2-sst2-fine-tuning [Dataset]. https://huggingface.co/datasets/OneFly7/llama2-sst2-fine-tuning
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 2, 2023
    Authors
    Yifei
    Description

    Dataset Card for "llama2-sst2-finetuning"

      Dataset Description
    

    The Llama2-sst2-fine-tuning dataset is designed for supervised fine-tuning of the LLaMA V2 based on the GLUE SST2 for sentiment analysis classification task.We provide two subsets: training and validation.To ensure the effectiveness of fine-tuning, we convert the data into the prompt template for LLaMA V2 supervised fine-tuning, where the data will follow this format:
    [INST] <

  9. Amharic Transformer pre-train and Fine-tuning data

    • kaggle.com
    zip
    Updated Apr 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahder Tesfaye Abebe (2025). Amharic Transformer pre-train and Fine-tuning data [Dataset]. https://www.kaggle.com/datasets/mahdertesfayeabebe/amharic-transformer-pre-train-and-fine-tuning-data
    Explore at:
    zip(117690980 bytes)Available download formats
    Dataset updated
    Apr 2, 2025
    Authors
    Mahder Tesfaye Abebe
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This Dataset is used to pre-train and fine-tune transformer network. To get details about how the data is collected and used visit, github repo. All the datas are collected by me except the data, 'original_hate_speech_data'. See the github repo for more detail !

  10. AI-MATH-LLM-Package

    • kaggle.com
    zip
    Updated Jun 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johnson chong (2024). AI-MATH-LLM-Package [Dataset]. https://www.kaggle.com/datasets/johnsonhk88/ai-math-llm-package
    Explore at:
    zip(3330554065 bytes)Available download formats
    Dataset updated
    Jun 20, 2024
    Authors
    Johnson chong
    Description

    This Install Package for LLM RAG, fine tuning essential library such as ( HuggingFace hub , transformer, langchain , evalate, sentence-transformers and etc. ) , suitable for Kaggle competition (offline) requirement which download form kaggle development environment.

    Support Package list as below: transformer datasets accelerate bitsandbytes langchain langchain-community sentence-transformers chromadb
    faiss-cpu huggingface_hub langchain-text-splitters
    peft trl umap-learn evaluate deepeval weave

    Suggestion install command in kaggle: !pip install transformers --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/tranformers !pip install -U datasets --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/datasets !pip install -U accelerate --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/accelerate !pip install build --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/build-1.2.1-py3-none-any.whl !pip install -U bitsandbytes --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/bitsandbytes-0.43.1-py3-none-manylinux_2_24_x86_64.whl !pip install langchain --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/langchain-0.2.5-py3-none-any.whl !pip install langchain-core --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/langchain_core-0.2.9-py3-none-any.whl !pip install langsmith --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/langsmith-0.1.81-py3-none-any.whl !pip install langchain-community --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/langchain_community-0.2.5-py3-none-any.whl !pip install sentence-transformers --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/sentence_transformers-3.0.1-py3-none-any.whl !pip install chromadb --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/chromadb-0.5.3-py3-none-any.whl !pip install faiss-cpu --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/faiss_cpu-1.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl !pip install -U huggingface_hub --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/huggingface_hub !pip install -qU langchain-text-splitters --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/langchain_text_splitters-0.2.1-py3-none-any.whl !pip install -U peft --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/peft-0.11.1-py3-none-any.whl !pip install -U trl --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/trl-0.9.4-py3-none-any.whl !pip install umap-learn --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/umap-learn !pip install evaluate --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/evaluate-0.4.2-py3-none-any.whl !pip install deepeval --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/deepeval-0.21.59-py3-none-any.whl !pip install weave --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/weave-0.50.2-py3-none-any.whl

  11. h

    investopedia-instruction-tuning-dataset

    • huggingface.co
    Updated Jul 26, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FinLang (2023). investopedia-instruction-tuning-dataset [Dataset]. https://huggingface.co/datasets/FinLang/investopedia-instruction-tuning-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 26, 2023
    Dataset authored and provided by
    FinLang
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Dataset Card for investopedia-instruction-tuning dataset

    We curate a dataset of substantial size pertaining to finance from Investopedia using a new technique that leverages unstructured scraping data and LLM to generate structured data that is suitable for fine-tuning embedding models. The dataset generation uses a new method of self-verification that ensures that the generated question-answer pairs and not hallucinated by the LLM with high probability.

      Dataset… See the full description on the dataset page: https://huggingface.co/datasets/FinLang/investopedia-instruction-tuning-dataset.
    
  12. Webpage Element Detection - COCO JSON

    • kaggle.com
    zip
    Updated Apr 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DesolationOfSmaug (2024). Webpage Element Detection - COCO JSON [Dataset]. https://www.kaggle.com/datasets/desolationofsmaug/webpage-element-detection
    Explore at:
    zip(55308148 bytes)Available download formats
    Dataset updated
    Apr 20, 2024
    Authors
    DesolationOfSmaug
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by DesolationOfSmaug

    Released under MIT

    Contents

  13. h

    KaLM-embedding-finetuning-data

    • huggingface.co
    Updated Oct 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KaLM-Embedding (2025). KaLM-embedding-finetuning-data [Dataset]. https://huggingface.co/datasets/KaLM-Embedding/KaLM-embedding-finetuning-data
    Explore at:
    Dataset updated
    Oct 8, 2025
    Dataset authored and provided by
    KaLM-Embedding
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The pretraining dataset is available at this link: HIT-TMG/KaLM-embedding-pretrain-data.

      Languages
    

    English, Chinese, Multilingual

      Dataset Structure
    

    Each in datasets is in the following format:

    query, string, one query per sample pos, list[string], usually containing one positive example neg, list[string], usually containing seven negative examples

      Dataset Summary
    

    All these datasets have been preprocessed and can be used for finetuning your embedding models.… See the full description on the dataset page: https://huggingface.co/datasets/KaLM-Embedding/KaLM-embedding-finetuning-data.

  14. Grammar Correction Dataset for Fine-Tuning

    • kaggle.com
    zip
    Updated Jan 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nezahat Korkmaz (2025). Grammar Correction Dataset for Fine-Tuning [Dataset]. https://www.kaggle.com/datasets/nezahatkk/grammar-correction-dataset-for-fine-tuning
    Explore at:
    zip(180270 bytes)Available download formats
    Dataset updated
    Jan 28, 2025
    Authors
    Nezahat Korkmaz
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    library_name: transformers

    tags: [fine-tuning, custom-dataset, educational-use, NLP, transformers]

    Model Card for Fine-Tuned Transformers Model

    Model Details

    Model Description

    This model was fine-tuned as part of an artificial intelligence course at Gazi University in Ankara using a custom dataset created by the students and instructors. The model is optimized for a specific task, such as sentiment analysis or text classification, in the Turkish language.

    • Developed by: Gazi University AI Course Team
    • Funded by [optional]: Gazi University
    • Shared by [optional]: Faculty Members and Students
    • Model type: Transformers-based language model (e.g., BERT or GPT)
    • Language(s) (NLP): Turkish
    • License: [CC BY-SA 4.0 or other appropriate license]
    • Finetuned from model [optional]: bert-base-turkish-cased (example)

    Model Sources [optional]

    Uses

    Direct Use

    The model can be directly used for tasks such as text classification, sentiment analysis, or other natural language processing tasks in Turkish.

    Downstream Use [optional]

    The model can be integrated into larger ecosystems or more complex projects.

    Out-of-Scope Use

    The model should not be used for unethical or malicious purposes. Additionally, it may have limited performance for multilingual tasks.

    Bias, Risks, and Limitations

    This model may inherit biases present in the training dataset. It is designed for English, and performance may degrade for other languages or domains outside its training data.

    Recommendations

    Users are advised to be aware of the model's limitations due to its training dataset and validate its results for their specific use case.

    How to Get Started with the Model

    You can use the following code snippet to load and test the model:

    from transformers import AutoTokenizer, AutoModelForSequenceClassification
    
    # Load the model
    model_name = "gazi-university/fine-tuned-turkish-model"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(model_name)
    
    # Example input
    text = "This AI model works perfectly!"
    inputs = tokenizer(text, return_tensors="pt")
    outputs = model(**inputs)
    
  15. LLM Text Generation Dataset

    • kaggle.com
    zip
    Updated Jun 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unidata (2025). LLM Text Generation Dataset [Dataset]. https://www.kaggle.com/datasets/unidpro/llm-training-dataset/discussion
    Explore at:
    zip(543652 bytes)Available download formats
    Dataset updated
    Jun 10, 2025
    Authors
    Unidata
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    LLM Fine-Tuning Dataset - Question Answering

    The dataset contains over 4 million+ logs written in 32 languages and is tailored for LLM training. It includes log and response pairs from 3 models, and is designed for language models and instruction fine-tuning to achieve improved performance in various NLP tasks - Get the data

    Models used for text generation:

    • GPT-3.5
    • GPT-4
    • Uncensored GPT Version (is not included inthe sample)

    Languages in the dataset:

    Ukrainian, Turkish, Thai, Swedish, Slovak, Portuguese (Brazil), Portuguese, Polish, Persian, Dutch, Maratham, Malayalam, Korean, Japanese, Italian, Indonesian, Hungarian, Hindi, Irish, Greek, German, French, Finnish, Esperanto, English, Danish, Czech, Chinese, Catalan, Azerbaijani, Arabic

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22059654%2F29b247bdef4577f0f9ebcd9654f54e19%2Fllm.png?generation=1727425740959521&alt=media" alt="">

    The dataset features a comprehensive training corpus with prompts and answers, suitable for generating text, question answering, and text classification. It enhances pre-trained LLMs, making it valuable for specific tasks, specific needs, and various generation tasks in the realm of language processing

    💵 Buy the Dataset: This is a limited preview of the data. To access the full dataset, please contact us at https://unidata.pro to discuss your requirements and pricing options.

    Content

    Dataset has the following columns: - language: language the prompt is made in, - model: type of the model (GPT-3.5, GPT-4 and Uncensored GPT Version), - time: time when the answer was generated, - text: user's prompt, - response: response generated by the model

    The text corpus supports instruction tuning and supervised fine-tuning for larger language models, enhancing text generation and human language understanding. With a focus on generating human-like content, it is useful for evaluating LLMs, improving generation capabilities, and performing well in classification tasks. This dataset also assists in mitigating biases, supporting longer texts, and optimizing LLM architectures for more effective language processing and language understanding.

    🌐 UniData provides high-quality datasets, content moderation, data collection and annotation for your AI/ML projects

  16. h

    llama-2-banking-fine-tune

    • huggingface.co
    Updated Jul 28, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Argilla (2023). llama-2-banking-fine-tune [Dataset]. https://huggingface.co/datasets/argilla/llama-2-banking-fine-tune
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 28, 2023
    Dataset authored and provided by
    Argilla
    Description

    Dataset Card for llama-2-banking-fine-tune

    This dataset has been created with Argilla. As shown in the sections below, this dataset can be loaded into Argilla as explained in Load with Argilla, or used directly with the datasets library in Load with datasets.

      Dataset Summary
    

    This dataset contains:

    A dataset configuration file conforming to the Argilla dataset format named argilla.yaml. This configuration file will be used to configure the dataset when using the… See the full description on the dataset page: https://huggingface.co/datasets/argilla/llama-2-banking-fine-tune.

  17. LLM-whls-fine-tuning

    • kaggle.com
    zip
    Updated May 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haofeng Xu (2024). LLM-whls-fine-tuning [Dataset]. https://www.kaggle.com/datasets/haofengxace/llm-whls-fine-tuning
    Explore at:
    zip(113058238 bytes)Available download formats
    Dataset updated
    May 23, 2024
    Authors
    Haofeng Xu
    Description

    Here are almost all the packages you may need for LLM fine-tuning. If you find this helpful, PLEASE UPVOTE!

  18. Data from: AstroChat

    • kaggle.com
    • huggingface.co
    zip
    Updated Jun 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    astro_pat (2024). AstroChat [Dataset]. https://www.kaggle.com/datasets/patrickfleith/astrochat
    Explore at:
    zip(1214166 bytes)Available download formats
    Dataset updated
    Jun 9, 2024
    Authors
    astro_pat
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Purpose and Scope

    The AstroChat dataset is a collection of 901 dialogues, synthetically generated, tailored to the specific domain of Astronautics / Space Mission Engineering. This dataset will be frequently updated following feedback from the community. If you would like to contribute, please reach out in the community discussion.

    Intended Use

    The dataset is intended to be used for supervised fine-tuning of chat LLMs (Large Language Models). Due to its currently limited size, you should use a pre-trained instruct model and ideally augment the AstroChat dataset with other datasets in the area of (Science Technology, Engineering and Math).

    Quickstart

    To be completed

    DATASET DESCRIPTION

    Access

    Structure

    901 generated conversations between a simulated user and AI-assistant (more on the generation method below). Each instance is made of the following field (column): - id: a unique identifier to refer to this specific conversation. Useeful for traceability purposes, especially for further processing task or merge with other datasets. - topic: a topic within the domain of Astronautics / Space Mission Engineering. This field is useful to filter the dataset by topic, or to create a topic-based split. - subtopic: a subtopic of the topic. For instance in the topic of Propulsion, there are subtopics like Injector Design, Combustion Instability, Electric Propulsion, Chemical Propulsion, etc. - persona: description of the persona used to simulate a user - opening_question: the first question asked by the user to start a conversation with the AI-assistant - messages: the whole conversation messages between the user and the AI assistant in already nicely formatted for rapid use with the transformers library. A list of messages where each message is a dictionary with the following fields: - role: the role of the speaker, either user or assistant - content: the message content. For the assistant, it is the answer to the user's question. For the user, it is the question asked to the assistant.

    Important See the full list of topics and subtopics covered below.

    Metadata

    Dataset is version controlled and commits history is available here: https://huggingface.co/datasets/patrickfleith/AstroChat/commits/main

    Generation Method

    We used a method inspired from Ultrachat dataset. Especially, we implemented our own version of Human-Model interaction from Sector I: Questions about the World of their paper:

    Ding, N., Chen, Y., Xu, B., Qin, Y., Zheng, Z., Hu, S., ... & Zhou, B. (2023). Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233.

    Step-by-step description

    • Defined a set of user persona
    • Defined a set of topics/ disciplines within the domain of Astronautics / Space Mission Engineering
    • For each topics, we defined a set of subtopics to narrow down the conversation to more specific and niche conversations (see below the full list)
    • For each subtopic we generate a set of opening questions that the user could ask to start a conversation (see below the full list)
    • We then distil the knowledge of an strong Chat Model (in our case ChatGPT through then api with gpt-4-turbo model) to generate the answers to the opening questions
    • We simulate follow-up questions from the user to the assistant, and the assistant's answers to these questions which builds up the messages.

    Future work and contributions appreciated

    • Distil knowledge from more models (Anthropic, Mixtral, GPT-4o, etc...)
    • Implement more creativity in the opening questions and follow-up questions
    • Filter-out questions and conversations which are too similar
    • Ask topic and subtopic expert to validate the generated conversations to have a sense on how reliable is the overall dataset

    Languages

    All instances in the dataset are in english

    Size

    901 synthetically-generated dialogue

    USAGE AND GUIDELINES

    License

    AstroChat © 2024 by Patrick Fleith is licensed under Creative Commons Attribution 4.0 International

    Restrictions

    No restriction. Please provide the correct attribution following the license terms.

    Citation

    Patrick Fleith. (2024). AstroChat - A Dataset of synthetically generated conversations for LLM supervised fine-tuning in the domain of Space Mission Engineering and Astronautics (Version 1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.11531579

    Update Frequency

    Will be updated based on feedbacks. I am also looking for contributors. Help me create more datasets for Space Engineering LLMs :)

    Have a feedback or spot an error?

    Use the ...

  19. h

    zulu-finetuning-dataset

    • huggingface.co
    Updated Aug 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hector Motsepe (2024). zulu-finetuning-dataset [Dataset]. https://huggingface.co/datasets/ChallengerSpaceShuttle/zulu-finetuning-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 29, 2024
    Authors
    Hector Motsepe
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for Alpaca-Cleaned

      Dataset Description
    

    This is a IsiZulu translated version of the original Alpaca Dataset released by Stanford, Cosmopedia by HuggingFace, and Wikihow (Mahnaz et al,. 2018).

      Original Alpaca Dataset Summary
    

    Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow… See the full description on the dataset page: https://huggingface.co/datasets/ChallengerSpaceShuttle/zulu-finetuning-dataset.

  20. daigtdataandcode

    • kaggle.com
    zip
    Updated Feb 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guanshuo Xu (2024). daigtdataandcode [Dataset]. https://www.kaggle.com/datasets/wowfattie/daigtpretraindata
    Explore at:
    zip(2253692051 bytes)Available download formats
    Dataset updated
    Feb 5, 2024
    Authors
    Guanshuo Xu
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Solution writeup: https://www.kaggle.com/competitions/llm-detect-ai-generated-text/discussion/470395

    For training only: train_neg_list.pickle and train_pos_list.pickle are around 500,000 pairs for pretraining classifiers. train_df.csv is for last step finetuning. train_v4_drcat_01.csv can be downloaded from https://www.kaggle.com/datasets/thedrcat/daigt-v4-train-dataset

    17/, 19/, 20/ in the code/ folder are for classifier pretraining, you need to run them first. _ft1/ are for finetuning on train_v4_drcat_01.csv _ft103/ are for finetuning on train_df.csv Please correct the input dirs in all the folders.

    Inference kernel: https://www.kaggle.com/code/wowfattie/daigt-2nd-place

    In case you are interested in how to generate train_neg_list.pickle and train_pos_list.pickle, everything is in the gaigtdatagenerationforpretrain/ folder. Perform the following steps: 1) Download the SlimPajama dataset. 2) Run preprocess_external_chunk1-10.py for file selection and random chunking. Only C4 subset was used, only files with word length > 2048 was used, because I want to make sure the LLMs have 1024 tokens as prompt and generate the next 1024 tokens. 3) Run the python files in every folders with LLM names. Note that some files may error because I forgot to add padding. I was able to only run roughly 90% of those files. 4) Run split1.py for assembling.

    If you are interested in how to generate the train_df.csv, go to daigtdatagenerationforfinetune/ and perform the following steps: 1) Install h2o-llmstudio 2) Run prepare_data_5_promts.py to generate input file for finetuning. Only essays of the 5 prompts in test set were included. 3) Perform finetuning. The config files are in the folders inside llmstudio_configs/. Those folders are named after the LLMs used. 4) Run all the python files with LLM names 5) Run prepare_train_data.py for assembling.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Shawhin Talebi, ai-job-embedding-finetuning [Dataset]. https://huggingface.co/datasets/shawhin/ai-job-embedding-finetuning

ai-job-embedding-finetuning

shawhin/ai-job-embedding-finetuning

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Shawhin Talebi
Description

Dataset for fine-tuning an embedding model for AI job search. Data sourced from datastax/linkedin_job_listings. Data used to fine-tune shawhin/distilroberta-ai-job-embeddings for AI job search. Links

GitHub Repo Video link Blog link

Search
Clear search
Close search
Google apps
Main menu