5 datasets found
  1. Databricks Dolly (15K)

    • kaggle.com
    • huggingface.co
    zip
    Updated Nov 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Databricks Dolly (15K) [Dataset]. https://www.kaggle.com/datasets/thedevastator/databricks-chatgpt-dataset/code
    Explore at:
    zip(4621394 bytes)Available download formats
    Dataset updated
    Nov 24, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Databricks Dolly (15K)

    Over 15,000 Language Models and Dialogues for Interactive Chat Applications

    By Huggingface Hub [source]

    About this dataset

    This exceptional dataset, created by Databricks employees, provides 15,000+ language models and dialogues to power dynamic ChatGPT applications. By generating prompt-response pairs from 8 different instruction categories, our goal is to facilitate the use of large language models for interactive dialogue interactions—all while avoiding information taken from any web sources except Wikipedia for particular instruction sets. Use this open-source dataset to explore the boundaries of text-based conversations and uncover new insights about natural language processing!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    First, let's take a look at the columns in this dataset: Instruction (string), Context (string), Response (string), Category (string). Each record represents a prompt-response pair or conversation between two people. The Instruction and Context fields contain what is said by one individual and the Response holds what is said back by another, culminating in a conversation. These paired entries are then classified into one of 8 different categories based on their content. Knowing this information can help you best utilize the corpus to your desired purposes.

    For example: if you are training a dialogue system you could develop multiple funneling pipelines using this dataset to enrich your model with real-world conversations or create intelligent chatbot interactions. If you want to generate natural language answers as part of Q&A systems then you could utilize excerpts from Wikipedia for particular subsets of instruction categories as well drawing upon prompt-response pairs within those given instructions all from within the Databricks set. Furthermore, since each record is independently labeled into one of 8 defined categories - such as make reservations or compare products - there are many possibilities for leveraging these classification labels with supervised learning techniques such as multi-class classification neural networks or logistic regression classifiers.

    In short, this substantial resource offers an array of creative ways to explore different types of dialogue related applications without being limited by needing data from external web sources – all that’s needed from here is your own imagination!

    Research Ideas

    • Generating deep learning models to detect and respond to conversational intent.
    • Training language models to use natural language processing (NLP) for customer service queries.
    • Creating custom dialogue agents that are better able to handle more complex conversational interactions, such as those powered by machine learning techniques like supervised or unsupervised learning methods

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: train.csv | Column name | Description | |:----------------|:----------------------------------------------------------------------------------------------------------------------------------------------------| | Instruction | Text prompt that should generate an appropriate response from a machine learning model/chatbot using natural language processing techniques. (Text) | | Context | Provides context to improve accuracy by giving the model more information about what’s happening in a conversation or request execution. (Text) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

  2. Job Dataset

    • kaggle.com
    zip
    Updated Sep 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ravender Singh Rana (2023). Job Dataset [Dataset]. https://www.kaggle.com/datasets/ravindrasinghrana/job-description-dataset
    Explore at:
    zip(479575920 bytes)Available download formats
    Dataset updated
    Sep 17, 2023
    Authors
    Ravender Singh Rana
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Job Dataset

    This dataset provides a comprehensive collection of synthetic job postings to facilitate research and analysis in the field of job market trends, natural language processing (NLP), and machine learning. Created for educational and research purposes, this dataset offers a diverse set of job listings across various industries and job types.

    Descriptions for each of the columns in the dataset:

    1. Job Id: A unique identifier for each job posting.
    2. Experience: The required or preferred years of experience for the job.
    3. Qualifications: The educational qualifications needed for the job.
    4. Salary Range: The range of salaries or compensation offered for the position.
    5. Location: The city or area where the job is located.
    6. Country: The country where the job is located.
    7. Latitude: The latitude coordinate of the job location.
    8. Longitude: The longitude coordinate of the job location.
    9. Work Type: The type of employment (e.g., full-time, part-time, contract).
    10. Company Size: The approximate size or scale of the hiring company.
    11. Job Posting Date: The date when the job posting was made public.
    12. Preference: Special preferences or requirements for applicants (e.g., Only Male or Only Female, or Both)
    13. Contact Person: The name of the contact person or recruiter for the job.
    14. Contact: Contact information for job inquiries.
    15. Job Title: The job title or position being advertised.
    16. Role: The role or category of the job (e.g., software developer, marketing manager).
    17. Job Portal: The platform or website where the job was posted.
    18. Job Description: A detailed description of the job responsibilities and requirements.
    19. Benefits: Information about benefits offered with the job (e.g., health insurance, retirement plans).
    20. Skills: The skills or qualifications required for the job.
    21. Responsibilities: Specific responsibilities and duties associated with the job.
    22. Company Name: The name of the hiring company.
    23. Company Profile: A brief overview of the company's background and mission.

    Potential Use Cases:

    • Building predictive models to forecast job market trends.
    • Enhancing job recommendation systems for job seekers.
    • Developing NLP models for resume parsing and job matching.
    • Analyzing regional job market disparities and opportunities.
    • Exploring salary prediction models for various job roles.

    Acknowledgements:

    We would like to express our gratitude to the Python Faker library for its invaluable contribution to the dataset generation process. Additionally, we appreciate the guidance provided by ChatGPT in fine-tuning the dataset, ensuring its quality, and adhering to ethical standards.

    Note:

    Please note that the examples provided are fictional and for illustrative purposes. You can tailor the descriptions and examples to match the specifics of your dataset. It is not suitable for real-world applications and should only be used within the scope of research and experimentation. You can also reach me via email at: rrana157@gmail.com

  3. d

    AI Hallucination Cases Database

    • damiencharlotin.com
    Updated Nov 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Damien Charlotin (2025). AI Hallucination Cases Database [Dataset]. https://www.damiencharlotin.com/hallucinations/
    Explore at:
    Dataset updated
    Nov 17, 2025
    Authors
    Damien Charlotin
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    A curated database of legal cases where generative AI produced hallucinated citations submitted in court filings.

  4. Customer Shopping Trends Dataset

    • kaggle.com
    zip
    Updated Oct 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sourav Banerjee (2023). Customer Shopping Trends Dataset [Dataset]. https://www.kaggle.com/datasets/iamsouravbanerjee/customer-shopping-trends-dataset
    Explore at:
    zip(149846 bytes)Available download formats
    Dataset updated
    Oct 5, 2023
    Authors
    Sourav Banerjee
    Description

    Context

    The Customer Shopping Preferences Dataset offers valuable insights into consumer behavior and purchasing patterns. Understanding customer preferences and trends is critical for businesses to tailor their products, marketing strategies, and overall customer experience. This dataset captures a wide range of customer attributes including age, gender, purchase history, preferred payment methods, frequency of purchases, and more. Analyzing this data can help businesses make informed decisions, optimize product offerings, and enhance customer satisfaction. The dataset stands as a valuable resource for businesses aiming to align their strategies with customer needs and preferences. It's important to note that this dataset is a Synthetic Dataset Created for Beginners to learn more about Data Analysis and Machine Learning.

    Content

    This dataset encompasses various features related to customer shopping preferences, gathering essential information for businesses seeking to enhance their understanding of their customer base. The features include customer age, gender, purchase amount, preferred payment methods, frequency of purchases, and feedback ratings. Additionally, data on the type of items purchased, shopping frequency, preferred shopping seasons, and interactions with promotional offers is included. With a collection of 3900 records, this dataset serves as a foundation for businesses looking to apply data-driven insights for better decision-making and customer-centric strategies.

    Dataset Glossary (Column-wise)

    • Customer ID - Unique identifier for each customer
    • Age - Age of the customer
    • Gender - Gender of the customer (Male/Female)
    • Item Purchased - The item purchased by the customer
    • Category - Category of the item purchased
    • Purchase Amount (USD) - The amount of the purchase in USD
    • Location - Location where the purchase was made
    • Size - Size of the purchased item
    • Color - Color of the purchased item
    • Season - Season during which the purchase was made
    • Review Rating - Rating given by the customer for the purchased item
    • Subscription Status - Indicates if the customer has a subscription (Yes/No)
    • Shipping Type - Type of shipping chosen by the customer
    • Discount Applied - Indicates if a discount was applied to the purchase (Yes/No)
    • Promo Code Used - Indicates if a promo code was used for the purchase (Yes/No)
    • Previous Purchases - The total count of transactions concluded by the customer at the store, excluding the ongoing transaction
    • Payment Method - Customer's most preferred payment method
    • Frequency of Purchases - Frequency at which the customer makes purchases (e.g., Weekly, Fortnightly, Monthly)

    Structure of the Dataset

    https://i.imgur.com/6UEqejq.png" alt="">

    Acknowledgement

    This dataset is a synthetic creation generated using ChatGPT to simulate a realistic customer shopping experience. Its purpose is to provide a platform for beginners and data enthusiasts, allowing them to create, enjoy, practice, and learn from a dataset that mirrors real-world customer shopping behavior. The aim is to foster learning and experimentation in a simulated environment, encouraging a deeper understanding of data analysis and interpretation in the context of consumer preferences and retail scenarios.

    Cover Photo by: Freepik

    Thumbnail by: Clothing icons created by Flat Icons - Flaticon

  5. Hotel Reservations Data

    • kaggle.com
    zip
    Updated Mar 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dimitris Angelides (2024). Hotel Reservations Data [Dataset]. https://www.kaggle.com/datasets/dimitrisangelide/hotel-reservations-data
    Explore at:
    zip(2567615 bytes)Available download formats
    Dataset updated
    Mar 4, 2024
    Authors
    Dimitris Angelides
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Tourism and travel holds more than 10% of the GDP worldwide, and is trending towards capturing higher stakes of the global pie. At the same time, it's an industry that generates huge volume of data and getting advantage of it could help businesses to stand out from the crowd.

    Content

    The dataset provides reservations data for two consecutive seasons (2021 - 2023) of a luxury hotel.

    Source

    ChatGPT 3.5 (OpenAI) is the main creator of the dataset. Minor adjustments were performed by myself to ensure that the dataset contains the desired fields and values.

    Inspiration

    • How effectively is the hotel performing across key metrics? • How are bookings distributed across different channels (e.g., Booking Platform, Phone, Walk-in, and Website)? • What is the current occupancy rate and how does it compare to the same period last year? • What are the demographics of the current guests (e.g., nationality)? • What is the average daily rate (ADR) per room?

    These are examples of interesting questions that could be answered by analyzing this dataset.

    If you are interested, please have a look at the Tableau dashboard that I have created to help answer the above questions. Tableau dashboard: https://public.tableau.com/app/profile/dimitris.angelides/viz/HotelExecutiveDashboards/HotelExecutiveSummaryReport?publish=yes

  6. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
The Devastator (2023). Databricks Dolly (15K) [Dataset]. https://www.kaggle.com/datasets/thedevastator/databricks-chatgpt-dataset/code
Organization logo

Databricks Dolly (15K)

Over 15,000 Language Models and Dialogues for Interactive Chat Applications

Explore at:
247 scholarly articles cite this dataset (View in Google Scholar)
zip(4621394 bytes)Available download formats
Dataset updated
Nov 24, 2023
Authors
The Devastator
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Databricks Dolly (15K)

Over 15,000 Language Models and Dialogues for Interactive Chat Applications

By Huggingface Hub [source]

About this dataset

This exceptional dataset, created by Databricks employees, provides 15,000+ language models and dialogues to power dynamic ChatGPT applications. By generating prompt-response pairs from 8 different instruction categories, our goal is to facilitate the use of large language models for interactive dialogue interactions—all while avoiding information taken from any web sources except Wikipedia for particular instruction sets. Use this open-source dataset to explore the boundaries of text-based conversations and uncover new insights about natural language processing!

More Datasets

For more datasets, click here.

Featured Notebooks

  • 🚨 Your notebook can be here! 🚨!

How to use the dataset

First, let's take a look at the columns in this dataset: Instruction (string), Context (string), Response (string), Category (string). Each record represents a prompt-response pair or conversation between two people. The Instruction and Context fields contain what is said by one individual and the Response holds what is said back by another, culminating in a conversation. These paired entries are then classified into one of 8 different categories based on their content. Knowing this information can help you best utilize the corpus to your desired purposes.

For example: if you are training a dialogue system you could develop multiple funneling pipelines using this dataset to enrich your model with real-world conversations or create intelligent chatbot interactions. If you want to generate natural language answers as part of Q&A systems then you could utilize excerpts from Wikipedia for particular subsets of instruction categories as well drawing upon prompt-response pairs within those given instructions all from within the Databricks set. Furthermore, since each record is independently labeled into one of 8 defined categories - such as make reservations or compare products - there are many possibilities for leveraging these classification labels with supervised learning techniques such as multi-class classification neural networks or logistic regression classifiers.

In short, this substantial resource offers an array of creative ways to explore different types of dialogue related applications without being limited by needing data from external web sources – all that’s needed from here is your own imagination!

Research Ideas

  • Generating deep learning models to detect and respond to conversational intent.
  • Training language models to use natural language processing (NLP) for customer service queries.
  • Creating custom dialogue agents that are better able to handle more complex conversational interactions, such as those powered by machine learning techniques like supervised or unsupervised learning methods

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:----------------|:----------------------------------------------------------------------------------------------------------------------------------------------------| | Instruction | Text prompt that should generate an appropriate response from a machine learning model/chatbot using natural language processing techniques. (Text) | | Context | Provides context to improve accuracy by giving the model more information about what’s happening in a conversation or request execution. (Text) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

Search
Clear search
Close search
Google apps
Main menu