100+ datasets found
  1. 📊 Best Open Source LLM Starter Pack 🧙🚀

    • kaggle.com
    Updated Aug 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Radek Osmulski (2023). 📊 Best Open Source LLM Starter Pack 🧙🚀 [Dataset]. https://www.kaggle.com/datasets/radek1/best-llm-starter-pack
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 17, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Radek Osmulski
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains a couple of great open source models!

    • version 2 -- the best open source LLM at the time of writing (NousResearch/Nous-Hermes-Llama2-13b) that we can load on Kaggle! didn't manage to load anything larger than 13B
    • version 14 -- loading models using a new library, curated-transformers that should allow for easier modifications of the underlying architectures.

    This dataset also includes all the dependencies we need to load the model in 8bit, if that is what you would like to do (updated version of transfomers, accelerate, etc).

    I show how to load and run Nous-Hermes-Llama2-13b in the following notebook:

    👉 💡 Best Open Source LLM Starter Pack 🧪🚀

    If you find this dataset helpful, please leave an upvote! 🙂 Thank you! 🙏

  2. Z

    Curlie Enhanced with LLM Annotations: Two Datasets for Advancing...

    • data.niaid.nih.gov
    Updated Dec 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cizinsky, Ludek (2023). Curlie Enhanced with LLM Annotations: Two Datasets for Advancing Homepage2Vec's Multilingual Website Classification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10413067
    Explore at:
    Dataset updated
    Dec 21, 2023
    Dataset provided by
    Nutter, Peter
    Cizinsky, Ludek
    Senghaas, Mika
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Advancing Homepage2Vec with LLM-Generated Datasets for Multilingual Website Classification

    This dataset contains two subsets of labeled website data, specifically created to enhance the performance of Homepage2Vec, a multi-label model for website classification. The datasets were generated using Large Language Models (LLMs) to provide more accurate and diverse topic annotations for websites, addressing a limitation of existing Homepage2Vec training data.

    Key Features:

    LLM-generated annotations: Both datasets feature website topic labels generated using LLMs, a novel approach to creating high-quality training data for website classification models.

    Improved multi-label classification: Fine-tuning Homepage2Vec with these datasets has been shown to improve its macro F1 score from 38% to 43% evaluated on a human-labeled dataset, demonstrating their effectiveness in capturing a broader range of website topics.

    Multilingual applicability: The datasets facilitate classification of websites in multiple languages, reflecting the inherent multilingual nature of Homepage2Vec.

    Dataset Composition:

    curlie-gpt3.5-10k: 10,000 websites labeled using GPT-3.5, context 2 and 1-shot

    curlie-gpt4-10k: 10,000 websites labeled using GPT-4, context 2 and zero-shot

    Intended Use:

    Fine-tuning and advancing Homepage2Vec or similar website classification models

    Research on LLM-generated datasets for text classification tasks

    Exploration of multilingual website classification

    Additional Information:

    Project and report repository: https://github.com/CS-433/ml-project-2-mlp

    Acknowledgments:

    This dataset was created as part of a project at EPFL's Data Science Lab (DLab) in collaboration with Prof. Robert West and Tiziano Piccardi.

  3. d

    Machine Learning (ML) Data | 800M+ B2B Profiles | AI-Ready for Deep Learning...

    • datarade.ai
    .json, .csv
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xverum, Machine Learning (ML) Data | 800M+ B2B Profiles | AI-Ready for Deep Learning (DL), NLP & LLM Training [Dataset]. https://datarade.ai/data-products/xverum-company-data-b2b-data-belgium-netherlands-denm-xverum
    Explore at:
    .json, .csvAvailable download formats
    Dataset provided by
    Xverum LLC
    Authors
    Xverum
    Area covered
    Norway, Dominican Republic, Sint Maarten (Dutch part), Cook Islands, Barbados, Western Sahara, Oman, Jordan, United Kingdom, India
    Description

    Xverum’s AI & ML Training Data provides one of the most extensive datasets available for AI and machine learning applications, featuring 800M B2B profiles with 100+ attributes. This dataset is designed to enable AI developers, data scientists, and businesses to train robust and accurate ML models. From natural language processing (NLP) to predictive analytics, our data empowers a wide range of industries and use cases with unparalleled scale, depth, and quality.

    What Makes Our Data Unique?

    Scale and Coverage: - A global dataset encompassing 800M B2B profiles from a wide array of industries and geographies. - Includes coverage across the Americas, Europe, Asia, and other key markets, ensuring worldwide representation.

    Rich Attributes for Training Models: - Over 100 fields of detailed information, including company details, job roles, geographic data, industry categories, past experiences, and behavioral insights. - Tailored for training models in NLP, recommendation systems, and predictive algorithms.

    Compliance and Quality: - Fully GDPR and CCPA compliant, providing secure and ethically sourced data. - Extensive data cleaning and validation processes ensure reliability and accuracy.

    Annotation-Ready: - Pre-structured and formatted datasets that are easily ingestible into AI workflows. - Ideal for supervised learning with tagging options such as entities, sentiment, or categories.

    How Is the Data Sourced? - Publicly available information gathered through advanced, GDPR-compliant web aggregation techniques. - Proprietary enrichment pipelines that validate, clean, and structure raw data into high-quality datasets. This approach ensures we deliver comprehensive, up-to-date, and actionable data for machine learning training.

    Primary Use Cases and Verticals

    Natural Language Processing (NLP): Train models for named entity recognition (NER), text classification, sentiment analysis, and conversational AI. Ideal for chatbots, language models, and content categorization.

    Predictive Analytics and Recommendation Systems: Enable personalized marketing campaigns by predicting buyer behavior. Build smarter recommendation engines for ecommerce and content platforms.

    B2B Lead Generation and Market Insights: Create models that identify high-value leads using enriched company and contact information. Develop AI systems that track trends and provide strategic insights for businesses.

    HR and Talent Acquisition AI: Optimize talent-matching algorithms using structured job descriptions and candidate profiles. Build AI-powered platforms for recruitment analytics.

    How This Product Fits Into Xverum’s Broader Data Offering Xverum is a leading provider of structured, high-quality web datasets. While we specialize in B2B profiles and company data, we also offer complementary datasets tailored for specific verticals, including ecommerce product data, job listings, and customer reviews. The AI Training Data is a natural extension of our core capabilities, bridging the gap between structured data and machine learning workflows. By providing annotation-ready datasets, real-time API access, and customization options, we ensure our clients can seamlessly integrate our data into their AI development processes.

    Why Choose Xverum? - Experience and Expertise: A trusted name in structured web data with a proven track record. - Flexibility: Datasets can be tailored for any AI/ML application. - Scalability: With 800M profiles and more being added, you’ll always have access to fresh, up-to-date data. - Compliance: We prioritize data ethics and security, ensuring all data adheres to GDPR and other legal frameworks.

    Ready to supercharge your AI and ML projects? Explore Xverum’s AI Training Data to unlock the potential of 800M global B2B profiles. Whether you’re building a chatbot, predictive algorithm, or next-gen AI application, our data is here to help.

    Contact us for sample datasets or to discuss your specific needs.

  4. LLM: 7 prompt training dataset

    • kaggle.com
    Updated Nov 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carl McBride Ellis (2023). LLM: 7 prompt training dataset [Dataset]. https://www.kaggle.com/datasets/carlmcbrideellis/llm-7-prompt-training-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 15, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Carl McBride Ellis
    License

    https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/

    Description
    • Version 4: Adding the data from "LLM-generated essay using PaLM from Google Gen-AI" kindly generated by Kingki19 / Muhammad Rizqi.
      File: train_essays_RDizzl3_seven_v2.csv
      Human texts: 14247 LLM texts: 3004

      See also: a new dataset of an additional 4900 LLM generated texts: LLM: Mistral-7B Instruct texts



    • Version 3: "**The RDizzl3 Seven**"
      File: train_essays_RDizzl3_seven_v1.csv

    • "Car-free cities"

    • "Does the electoral college work?"

    • "Exploring Venus"

    • "The Face on Mars"

    • "Facial action coding system"

    • "A Cowboy Who Rode the Waves"

    • "Driverless cars"

    How this dataset was made: see the notebook "LLM: Make 7 prompt train dataset"

    • Version 2: (train_essays_7_prompts_v2.csv) This dataset is composed of 13,712 human texts and 1638 AI-LLM generated texts originating from 7 of the PERSUADE 2.0 corpus prompts.

    Namely:

    • "Car-free cities"
    • "Does the electoral college work?"
    • "Exploring Venus"
    • "The Face on Mars"
    • "Facial action coding system"
    • "Seeking multiple opinions"
    • "Phones and driving"

    This dataset is a derivative of the datasets

    as well as the original competition training dataset

    • Version 1:This dataset is composed of 13,712 human texts and 1165 AI-LLM generated texts originating from 7 of the PERSUADE 2.0 corpus prompts.
  5. h

    llm_robot

    • huggingface.co
    Updated May 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aryaduta (2024). llm_robot [Dataset]. https://huggingface.co/datasets/Aryaduta/llm_robot
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 5, 2024
    Authors
    Aryaduta
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for Robotic Plan Generation

    This dataset is for training LLM for robotic plan generation.

      Dataset Details
    
    
    
    
    
      Dataset Description
    

    The aim is to provide dataset that contain context (in this case, arm robot is used as example and 2 object is manipulated) and user goal. The output should be json string containing high level function that will be executed by the robot.

      Dataset Structure
    
    
    
    
    
    
    
      Data Instances
    

    A JSON-formatted example… See the full description on the dataset page: https://huggingface.co/datasets/Aryaduta/llm_robot.

  6. B

    Data from: TruthEval: A Dataset to Evaluate LLM Truthfulness and Reliability...

    • datasetcatalog.nlm.nih.gov
    • borealisdata.ca
    Updated Jul 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Khatun, Aisha; Brown, Dan (2024). TruthEval: A Dataset to Evaluate LLM Truthfulness and Reliability [Dataset]. http://doi.org/10.5683/SP3/5MZWBV
    Explore at:
    Dataset updated
    Jul 30, 2024
    Authors
    Khatun, Aisha; Brown, Dan
    Description

    Large Language Model (LLM) evaluation is currently one of the most important areas of research, with existing benchmarks proving to be insufficient and not completely representative of LLMs' various capabilities. We present a curated collection of challenging statements on sensitive topics for LLM benchmarking called TruthEval. These statements were curated by hand and contain known truth values. The categories were chosen to distinguish LLMs' abilities from their stochastic nature. Details of collection method and use cases can be found in this paper: TruthEval: A Dataset to Evaluate LLM Truthfulness and Reliability

  7. d

    FileMarket | 20,000 photos | AI Training Data | Large Language Model (LLM)...

    • datarade.ai
    Updated Jun 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FileMarket (2024). FileMarket | 20,000 photos | AI Training Data | Large Language Model (LLM) Data | Machine Learning (ML) Data | Deep Learning (DL) Data | [Dataset]. https://datarade.ai/data-products/filemarket-ai-training-data-large-language-model-llm-data-filemarket
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Jun 28, 2024
    Dataset authored and provided by
    FileMarket
    Area covered
    Papua New Guinea, Antigua and Barbuda, Benin, Saudi Arabia, Saint Kitts and Nevis, French Southern Territories, Brazil, Central African Republic, Colombia, China
    Description

    FileMarket provides premium Large Language Model (LLM) Data designed to support and enhance a wide range of AI applications. Our globally sourced LLM Data sets are meticulously curated to ensure high quality, diversity, and accuracy, making them ideal for training robust and reliable language models. In addition to LLM Data, we also offer comprehensive datasets across Object Detection Data, Machine Learning (ML) Data, Deep Learning (DL) Data, and Biometric Data. Each dataset is carefully crafted to meet the specific needs of cutting-edge AI and machine learning projects.

    Key use cases of our Large Language Model (LLM) Data:

    Text generation Chatbots and virtual assistants Machine translation Sentiment analysis Speech recognition Content summarization Why choose FileMarket's data:

    Object Detection Data: Essential for training AI in image and video analysis. Machine Learning (ML) Data: Ideal for a broad spectrum of applications, from predictive analysis to NLP. Deep Learning (DL) Data: Designed to support complex neural networks and deep learning models. Biometric Data: Specialized for facial recognition, fingerprint analysis, and other biometric applications. FileMarket's premier sources for top-tier Large Language Model (LLM) Data and other specialized datasets ensure your AI projects drive innovation and achieve success across various applications.

  8. h

    EverythingLM-data-V2

    • huggingface.co
    Updated Aug 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kai Howard (2023). EverythingLM-data-V2 [Dataset]. https://huggingface.co/datasets/totally-not-an-llm/EverythingLM-data-V2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 19, 2023
    Authors
    Kai Howard
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    EverythingLM V2 Dataset

    EverythingLM V2 is a diverse instruct dataset consisting of 1k of human-assistant conversations. These sets were generated using principles from both evol-instruct and Orca. The dataset encompasses a wide array of topics and interactions.

      Differences for V1:
    

    All data in V2 is generated by GPT4 Higher quality dataset generation pipeline: More humalike seed prompts Fixed some bugs in the script More diverse creative writing More diverse seed prompts… See the full description on the dataset page: https://huggingface.co/datasets/totally-not-an-llm/EverythingLM-data-V2.

  9. h

    NeurIPS-LLM-data

    • huggingface.co
    Updated Mar 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Upaya (2024). NeurIPS-LLM-data [Dataset]. https://huggingface.co/datasets/upaya07/NeurIPS-LLM-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 4, 2024
    Dataset authored and provided by
    Upaya
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    🤖 We curated this dataset for NeurIPS Large Language Model Efficiency Challenge: 1 LLM + 1GPU + 1Day. 🚀 Our Birbal-7B-V1 fine-tuned on this dataset achieved 🏆 first rank 🏆 in the competition.

    Here is high-level diagram of our data preparation strategy:

      Natural Instructions Dataset Preparation
    

    Natural Instructionsdataset is a community effort to create a large collection of tasks and their natural language definitions/instructions. As show in above diagram, we sample from… See the full description on the dataset page: https://huggingface.co/datasets/upaya07/NeurIPS-LLM-data.

  10. Dataset for fine-tuning LLM to generate MiniZinc

    • kaggle.com
    Updated Mar 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roberto Penco (2025). Dataset for fine-tuning LLM to generate MiniZinc [Dataset]. http://doi.org/10.34740/kaggle/dsv/11207997
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 29, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Roberto Penco
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is used to fine-tune LLM so that LLM generates with higher probability valid and correct MiniZinc code from natural lnaguage description. Dataset had 50 etries and 8 coulumns. That is one large dataset that can be split into smaller dataset used to finetune multiple LLMs that each have its own role.

  11. Article Dataset (Mini)

    • kaggle.com
    Updated Oct 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sani Kamal (2024). Article Dataset (Mini) [Dataset]. https://www.kaggle.com/datasets/sanikamal/article-50
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 18, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sani Kamal
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Overview

    This dataset contains 50 articles sourced from Medium, focusing on AI-related content. It is designed for business owners, content creators, and AI developers looking to analyze successful articles, improve engagement, and fine-tune AI language models (LLMs). The data can be used to explore what makes articles perform well, including sentiment analysis, follower counts, and headline effectiveness.

    Dataset Contents

    • articles_50.db - Sample database with 50 articles(Free Version)

    The database includes pre-analyzed data such as sentiment scores, follower counts, and headline metadata, helping users gain insights into high-performing content.

    Use Cases

    • Content Strategy Optimization: Identify trends in successful AI-related articles to enhance your content approach.
    • Headline Crafting: Study patterns in top-performing headlines to create more compelling article titles.
    • LLM Fine-Tuning: Utilize the dataset to fine-tune AI models with real-world data on content performance.
    • Sentiment-Driven Content: Create content that resonates with readers by aligning with sentiment insights.

    This dataset is a valuable tool for anyone aiming to harness the power of data-driven insights to enhance their content or AI models.

  12. F

    Japanese Closed Ended Question Answer Text Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Japanese Closed Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/japanese-closed-ended-question-answer-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    The Japanese Closed-Ended Question Answering Dataset is a meticulously curated collection of 5000 comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and question-answering models in the Japanese language, advancing the field of artificial intelligence.

    Dataset Content:

    This closed-ended QA dataset comprises a diverse set of context paragraphs and questions paired with corresponding answers in Japanese. There is a context paragraph given for each question to get the answer from. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.

    Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Japanese people, and references were taken from diverse sources like books, news articles, websites, web forums, and other reliable references.

    This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.

    Question Diversity:

    To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. The QA dataset also contains questions with constraints, which makes it even more useful for LLM training.

    Answer Formats:

    To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraphs types of answers. The answers contain text strings, numerical values, date and time formats as well. Such diversity strengthens the language model's ability to generate coherent and contextually appropriate answers.

    Data Format and Annotation Details:

    This fully labeled Japanese Closed-Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as a unique id, context paragraph, context reference link, question, question type, question complexity, question category, domain, prompt type, answer, answer type, and rich text presence.

    Quality and Accuracy:

    The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.

    The Japanese versions is grammatically accurate without any spelling or grammatical errors. No toxic or harmful content is used while building this dataset.

    Continuous Updates and Customization:

    The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.

    License:

    The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Japanese Closed-Ended Question Answer Dataset to enhance the language understanding capabilities of their generative AI models, improve response generation, and explore new approaches to NLP question-answering tasks.

  13. 300M Image-Caption Pairs – Large-Scale Vision-Language Dataset for AI...

    • m.nexdata.ai
    • nexdata.ai
    Updated Jan 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2025). 300M Image-Caption Pairs – Large-Scale Vision-Language Dataset for AI Training [Dataset]. https://m.nexdata.ai/datasets/llm/1451?source=Github
    Explore at:
    Dataset updated
    Jan 30, 2025
    Dataset authored and provided by
    Nexdata
    Variables measured
    Data size, Data types, Data content, Data formats, Data resolution, Description languages
    Description

    300 Million Pairs of High-Quality Image-Caption Dataset includes a large-scale collection of photographic and vector images paired with English textual descriptions. The complete image library comprises nearly 300 million images, with a curated subset of 100 million high-quality image-caption pairs available for generative AI and vision-language model training. All images are authentic and legally licensed works created by professional photographers. The dataset primarily features English captions with minimal Chinese, offering diverse scenes, objects, and compositions suitable for tasks such as image captioning, visual question answering (VQA), image-text retrieval, and multimodal foundation model pretraining. The dataset supports large-scale LLM and VLM applications and complies with global data privacy and copyright regulations, including GDPR, CCPA, and PIPL.

  14. F

    Finnish Open Ended Question Answer Text Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Finnish Open Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/finnish-open-ended-question-answer-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    The Finnish Open-Ended Question Answering Dataset is a meticulously curated collection of comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and Question-answering models in the Finnish language, advancing the field of artificial intelligence.

    Dataset Content:

    This QA dataset comprises a diverse set of open-ended questions paired with corresponding answers in Finnish. There is no context paragraph given to choose an answer from, and each question is answered without any predefined context content. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.

    Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Finnish people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.

    This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.

    Question Diversity:

    To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. Additionally, questions are further classified into fact-based and opinion-based categories, creating a comprehensive variety. The QA dataset also contains the question with constraints and persona restrictions, which makes it even more useful for LLM training.

    Answer Formats:

    To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraph types of answers. The answer contains text strings, numerical values, date and time formats as well. Such diversity strengthens the Language model's ability to generate coherent and contextually appropriate answers.

    Data Format and Annotation Details:

    This fully labeled Finnish Open Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as id, language, domain, question_length, prompt_type, question_category, question_type, complexity, answer_type, rich_text.

    Quality and Accuracy:

    The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.

    Both the question and answers in Finnish are grammatically accurate without any word or grammatical errors. No copyrighted, toxic, or harmful content is used while building this dataset.

    Continuous Updates and Customization:

    The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.

    License:

    The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Finnish Open Ended Question Answer Dataset to enhance the language understanding capabilities of their generative ai models, improve response generation, and explore new approaches to NLP question-answering tasks.

  15. 100K English Instruction Tuning Dataset – General Domain SFT for LLM...

    • nexdata.ai
    Updated Jul 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2025). 100K English Instruction Tuning Dataset – General Domain SFT for LLM Fine-Tuning [Dataset]. https://www.nexdata.ai/datasets/llm/1814
    Explore at:
    Dataset updated
    Jul 11, 2025
    Dataset authored and provided by
    Nexdata
    Variables measured
    Format, Language, Data volume, Data content
    Description

    100,000 Fine-Tuning Text Dataset for English LLM General Domain SFT is a high-quality supervised fine-tuning corpus designed to optimize instruction-following capabilities in large language models. Each data point is double-verified by experienced linguistic professionals and AI engineers to ensure relevance, clarity, and effectiveness in improving model alignment and response precision. The dataset supports instruction tuning tasks across a wide range of general knowledge domains and is compatible with leading open-source LLMs such as LLaMA, Falcon, GPT-NeoX, and Mistral. Ideal for use in alignment, safety tuning, and instruction-based generation enhancement, this dataset offers a robust foundation for model adaptation and performance improvement. All data complies with global data usage and privacy standards.

  16. d

    Coresignal | Clean Data | Company Data | AI-Enriched Datasets | Global /...

    • datarade.ai
    .json, .csv
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Coresignal, Coresignal | Clean Data | Company Data | AI-Enriched Datasets | Global / 35M+ Records / Updated Weekly [Dataset]. https://datarade.ai/data-products/coresignal-clean-data-company-data-ai-enriched-datasets-coresignal
    Explore at:
    .json, .csvAvailable download formats
    Dataset authored and provided by
    Coresignal
    Area covered
    Guinea-Bissau, Hungary, Guatemala, Niue, Panama, Andorra, Guadeloupe, Chile, Namibia, Saint Barthélemy
    Description

    This clean dataset is a refined version of our company datasets, consisting of 35M+ data records.

    It’s an excellent data solution for companies with limited data engineering capabilities and those who want to reduce their time to value. You get filtered, cleaned, unified, and standardized B2B data. After cleaning, this data is also enriched by leveraging a carefully instructed large language model (LLM).

    AI-powered data enrichment offers more accurate information in key data fields, such as company descriptions. It also produces over 20 additional data points that are very valuable to B2B businesses. Enhancing and highlighting the most important information in web data contributes to quicker time to value, making data processing much faster and easier.

    For your convenience, you can choose from multiple data formats (Parquet, JSON, JSONL, or CSV) and select suitable delivery frequency (quarterly, monthly, or weekly).

    Coresignal is a leading public business data provider in the web data sphere with an extensive focus on firmographic data and public employee profiles. More than 3B data records in different categories enable companies to build data-driven products and generate actionable insights. Coresignal is exceptional in terms of data freshness, with 890M+ records updated monthly for unprecedented accuracy and relevance.

  17. S

    The big model fine-tuning data set of five key elements of tourism resources...

    • scidb.cn
    Updated Oct 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    lu bao qing; Wan Fucheng; Yu Hongzhi; Chen Min (2024). The big model fine-tuning data set of five key elements of tourism resources in the five northwestern provinces in 2024 [Dataset]. http://doi.org/10.57760/sciencedb.j00001.01088
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 17, 2024
    Dataset provided by
    Science Data Bank
    Authors
    lu bao qing; Wan Fucheng; Yu Hongzhi; Chen Min
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    With the wide application of large models in various fields, the demand for high-quality data sets in the tourism industry is increasing to support the improvement of the model 's ability to understand and generate tourism information. This dataset focuses on textual data in the tourism domain and is designed to support fine-tuning tasks for tourism-oriented large models, aiming to enhance the model's ability to understand and generate tourism-related information. The diversity and quality of the dataset are critical to the model's performance. Therefore, this study combines web scraping and manual annotation techniques, along with data cleaning, denoising, and stopword removal, to ensure high data quality and accuracy. Additionally, automated annotation tools are used to generate instructions and perform consistency checks on the texts. The LLM-Tourism dataset primarily relies on data from Ctrip and Baidu Baike, covering five Northwestern Chinese provinces: Gansu, Ningxia, Qinghai, Shaanxi, and Xinjiang, containing 53,280 pairs of structured data in JSON format. The creation of this dataset will not only improve the generation accuracy of tourism large models but also contribute to the sharing and application of tourism-related datasets in the field of large models.

  18. Data from: Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated May 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata (2023). Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph Generation from Text [Dataset]. http://doi.org/10.5281/zenodo.7916716
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 23, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the repository for ISWC 2023 Resource Track submission for Text2KGBench: Benchmark for Ontology-Driven Knowledge Graph Generation from Text. Text2KGBench is a benchmark to evaluate the capabilities of language models to generate KGs from natural language text guided by an ontology. Given an input ontology and a set of sentences, the task is to extract facts from the text while complying with the given ontology (concepts, relations, domain/range constraints) and being faithful to the input sentences.

    It contains two datasets (i) Wikidata-TekGen with 10 ontologies and 13,474 sentences and (ii) DBpedia-WebNLG with 19 ontologies and 4,860 sentences.

    An example

    An example test sentence:

    Test Sentence:
    {"id": "ont_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by 
    American songwriters Gerry Goffin and Carole King."}
    

    An example of ontology:

    Ontology: Music Ontology

    Expected Output:

    {
     "id": "ont_k_music_test_n", 
     "sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King.", 
     "triples": [
     {
      "sub": "The Loco-Motion", 
      "rel": "publication date",
      "obj": "01 January 1962"
     },{
      "sub": "The Loco-Motion",
      "rel": "lyrics by",
      "obj": "Gerry Goffin"
     },{
      "sub": "The Loco-Motion", 
      "rel": "lyrics by", 
      "obj": "Carole King"
     },]
    }
    

    The data is released under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY 4.0) License.

    The structure of the repo is as the following.

    This benchmark contains data derived from the TekGen corpus (part of the KELM corpus) [1] released under CC BY-SA 2.0 license and WebNLG 3.0 corpus [2] released under CC BY-NC-SA 4.0 license.

    [1] Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. 2021. Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565, Online. Association for Computational Linguistics.

    [2] Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating Training Corpora for NLG Micro-Planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages

  19. d

    Large Language Model (LLM) Data | 10 Million POI Average Noise Levels | 35 B...

    • datarade.ai
    Updated Apr 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Silencio Network (2025). Large Language Model (LLM) Data | 10 Million POI Average Noise Levels | 35 B + Data Points | 100% Traceable Consent [Dataset]. https://datarade.ai/data-products/ai-training-data-global-hyper-local-average-noise-levels-silencio-network
    Explore at:
    .json, .xml, .csv, .xlsAvailable download formats
    Dataset updated
    Apr 9, 2025
    Dataset provided by
    Quickkonnect UG
    Authors
    Silencio Network
    Area covered
    Argentina, Slovenia, New Zealand, Kyrgyzstan, Congo (Democratic Republic of the), Falkland Islands (Malvinas), Denmark, American Samoa, Burundi, Nigeria
    Description

    Connect with our experts for Street and Venue Noise-Level Data. Unlock unique insights into the real-world acoustic environment of cities and venues across 180+ countries. Silencio has built the world’s largest database on noise levels, statistically interpolated using over 35 billion datapoints, developed in collaboration with leading acoustics professionals. Unlike traditional models that rely solely on computed estimations, our dataset uniquely combines real-world measurements with AI-driven predictions to deliver the most accurate and reliable noise-level data available today.

    Maximize AI Performance with the World’s Largest Real-World Noise-Level Dataset

    What sets our dataset apart? Silencio’s Street and Venue Noise-Level Data is the world’s largest and most accurate collection of real-world acoustic data, combining over 35 billion datapoints with AI-driven interpolation, developed together with professional acousticians. Unlike synthetic models, our dataset integrates real measurements and AI predictions to provide unparalleled ground truth for AI training.

    Designed for AI Applications: Empower your AI models with high-quality, diverse, and realistic acoustic data. Ideal for training AI in sound recognition, noise mapping, autonomous systems, smart cities, mobility intelligence, and beyond.

    Reliable & Compliant: Collected through our mobile app with explicit user consent, fully anonymized, and fully GDPR-compliant, ensuring ethical sourcing and regulatory alignment.

    Historical & Real-Time: Train models using both historical and continuously updated data to improve accuracy and robustness over time and across regions.

    Granular & Customizable: Globally available, highly granular, and adaptable to your AI pipeline needs — from raw acoustic datapoints to aggregated sound profiles.

    Simple Integration: Delivered via CSV exports or S3 bucket delivery (APIs coming soon), allowing smooth integration into your existing AI training workflows.

  20. F

    Russian Brainstorming Prompt & Response Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Russian Brainstorming Prompt & Response Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/russian-brainstorming-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    Welcome to the Russian Brainstorming Prompt-Response Dataset, a meticulously curated collection of 2000 prompt and response pairs. This dataset is a valuable resource for enhancing the creative and generative abilities of Language Models (LMs), a critical aspect in advancing generative AI.

    Dataset Content:

    This brainstorming dataset comprises a diverse set of prompts and responses where the prompt contains instruction, context, constraints, and restrictions while completion contains the most accurate response list for the given prompt. Both these prompts and completions are available in Russian language.

    These prompt and completion pairs cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more. Each prompt is accompanied by a response, providing valuable information and insights to enhance the language model training process. Both the prompt and response were manually curated by native Russian people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.

    This dataset encompasses various prompt types, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. Additionally, you'll find prompts and responses containing rich text elements, such as tables, code, JSON, etc., all in proper markdown format.

    Prompt Diversity:

    To ensure diversity, our brainstorming dataset features prompts of varying complexity levels, ranging from easy to medium and hard. The prompts also vary in length, including short, medium, and long prompts, providing a comprehensive range. Furthermore, the dataset includes prompts with constraints and persona restrictions, making it exceptionally valuable for LLM training.

    Response Formats:

    Our dataset accommodates diverse learning experiences, offering responses across different domains depending on the prompt. For these brainstorming prompts, responses are generally provided in list format. These responses encompass text strings, numerical values, and dates, enhancing the language model's ability to generate reliable, coherent, and contextually appropriate answers.

    Data Format and Annotation Details:

    This fully labeled Russian Brainstorming Prompt Completion Dataset is available in both JSON and CSV formats. It includes comprehensive annotation details, including a unique ID, prompt, prompt type, prompt length, prompt complexity, domain, response, and the presence of rich text.

    Quality and Accuracy:

    Our dataset upholds the highest standards of quality and accuracy. Each prompt undergoes meticulous validation, and the corresponding responses are thoroughly verified. We prioritize inclusivity, ensuring that the dataset incorporates prompts and completions representing diverse perspectives and writing styles, maintaining an unbiased and discrimination-free stance.

    The Russian version is grammatically accurate without any spelling or grammatical errors. No copyrighted, toxic, or harmful content is used during the construction of this dataset.

    Continuous Updates and Customization:

    The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. We continuously work to expand this dataset, ensuring its ongoing growth and relevance. Additionally, FutureBeeAI offers the flexibility to curate custom brainstorming prompt and completion datasets tailored to specific requirements, providing you with customization options.

    License:

    This dataset, created by FutureBeeAI, is now available for commercial use. Researchers, data scientists, and developers can leverage this fully labeled and ready-to-deploy Russian Brainstorming Prompt-Completion Dataset to enhance the creative and accurate response generation capabilities of their generative AI models and explore new approaches to NLP tasks.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Radek Osmulski (2023). 📊 Best Open Source LLM Starter Pack 🧙🚀 [Dataset]. https://www.kaggle.com/datasets/radek1/best-llm-starter-pack
Organization logo

📊 Best Open Source LLM Starter Pack 🧙🚀

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 17, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Radek Osmulski
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

This dataset contains a couple of great open source models!

  • version 2 -- the best open source LLM at the time of writing (NousResearch/Nous-Hermes-Llama2-13b) that we can load on Kaggle! didn't manage to load anything larger than 13B
  • version 14 -- loading models using a new library, curated-transformers that should allow for easier modifications of the underlying architectures.

This dataset also includes all the dependencies we need to load the model in 8bit, if that is what you would like to do (updated version of transfomers, accelerate, etc).

I show how to load and run Nous-Hermes-Llama2-13b in the following notebook:

👉 💡 Best Open Source LLM Starter Pack 🧪🚀

If you find this dataset helpful, please leave an upvote! 🙂 Thank you! 🙏

Search
Clear search
Close search
Google apps
Main menu