5 datasets found

Databricks Dolly (15K)
kaggle.com
huggingface.co
zip
Updated Nov 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Databricks Dolly (15K) [Dataset]. https://www.kaggle.com/datasets/thedevastator/databricks-chatgpt-dataset/code
Explore at:
zip(4621394 bytes)Available download formats
Dataset updated
Nov 24, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Databricks Dolly (15K)

Over 15,000 Language Models and Dialogues for Interactive Chat Applications

By Huggingface Hub [source]

About this dataset

This exceptional dataset, created by Databricks employees, provides 15,000+ language models and dialogues to power dynamic ChatGPT applications. By generating prompt-response pairs from 8 different instruction categories, our goal is to facilitate the use of large language models for interactive dialogue interactions—all while avoiding information taken from any web sources except Wikipedia for particular instruction sets. Use this open-source dataset to explore the boundaries of text-based conversations and uncover new insights about natural language processing!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

First, let's take a look at the columns in this dataset: Instruction (string), Context (string), Response (string), Category (string). Each record represents a prompt-response pair or conversation between two people. The Instruction and Context fields contain what is said by one individual and the Response holds what is said back by another, culminating in a conversation. These paired entries are then classified into one of 8 different categories based on their content. Knowing this information can help you best utilize the corpus to your desired purposes.

For example: if you are training a dialogue system you could develop multiple funneling pipelines using this dataset to enrich your model with real-world conversations or create intelligent chatbot interactions. If you want to generate natural language answers as part of Q&A systems then you could utilize excerpts from Wikipedia for particular subsets of instruction categories as well drawing upon prompt-response pairs within those given instructions all from within the Databricks set. Furthermore, since each record is independently labeled into one of 8 defined categories - such as make reservations or compare products - there are many possibilities for leveraging these classification labels with supervised learning techniques such as multi-class classification neural networks or logistic regression classifiers.

In short, this substantial resource offers an array of creative ways to explore different types of dialogue related applications without being limited by needing data from external web sources – all that’s needed from here is your own imagination!

Research Ideas

Generating deep learning models to detect and respond to conversational intent.

Training language models to use natural language processing (NLP) for customer service queries.

Creating custom dialogue agents that are better able to handle more complex conversational interactions, such as those powered by machine learning techniques like supervised or unsupervised learning methods

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:----------------|:----------------------------------------------------------------------------------------------------------------------------------------------------| | Instruction | Text prompt that should generate an appropriate response from a machine learning model/chatbot using natural language processing techniques. (Text) | | Context | Provides context to improve accuracy by giving the model more information about what’s happening in a conversation or request execution. (Text) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Job Dataset
kaggle.com
zip
Updated Sep 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ravender Singh Rana (2023). Job Dataset [Dataset]. https://www.kaggle.com/datasets/ravindrasinghrana/job-description-dataset
Explore at:
zip(479575920 bytes)Available download formats
Dataset updated
Sep 17, 2023
Authors
Ravender Singh Rana
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Job Dataset

This dataset provides a comprehensive collection of synthetic job postings to facilitate research and analysis in the field of job market trends, natural language processing (NLP), and machine learning. Created for educational and research purposes, this dataset offers a diverse set of job listings across various industries and job types.

Descriptions for each of the columns in the dataset:

Job Id: A unique identifier for each job posting.

Experience: The required or preferred years of experience for the job.

Qualifications: The educational qualifications needed for the job.

Salary Range: The range of salaries or compensation offered for the position.

Location: The city or area where the job is located.

Country: The country where the job is located.

Latitude: The latitude coordinate of the job location.

Longitude: The longitude coordinate of the job location.

Work Type: The type of employment (e.g., full-time, part-time, contract).

Company Size: The approximate size or scale of the hiring company.

Job Posting Date: The date when the job posting was made public.

Preference: Special preferences or requirements for applicants (e.g., Only Male or Only Female, or Both)

Contact Person: The name of the contact person or recruiter for the job.

Contact: Contact information for job inquiries.

Job Title: The job title or position being advertised.

Role: The role or category of the job (e.g., software developer, marketing manager).

Job Portal: The platform or website where the job was posted.

Job Description: A detailed description of the job responsibilities and requirements.

Benefits: Information about benefits offered with the job (e.g., health insurance, retirement plans).

Skills: The skills or qualifications required for the job.

Responsibilities: Specific responsibilities and duties associated with the job.

Company Name: The name of the hiring company.

Company Profile: A brief overview of the company's background and mission.

Potential Use Cases:

Building predictive models to forecast job market trends.

Enhancing job recommendation systems for job seekers.

Developing NLP models for resume parsing and job matching.

Analyzing regional job market disparities and opportunities.

Exploring salary prediction models for various job roles.

Acknowledgements:

We would like to express our gratitude to the Python Faker library for its invaluable contribution to the dataset generation process. Additionally, we appreciate the guidance provided by ChatGPT in fine-tuning the dataset, ensuring its quality, and adhering to ethical standards.

Note:

Please note that the examples provided are fictional and for illustrative purposes. You can tailor the descriptions and examples to match the specifics of your dataset. It is not suitable for real-world applications and should only be used within the scope of research and experimentation. You can also reach me via email at: rrana157@gmail.com
d
AI Hallucination Cases Database
damiencharlotin.com
Updated Nov 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Damien Charlotin (2025). AI Hallucination Cases Database [Dataset]. https://www.damiencharlotin.com/hallucinations/
Explore at:
Dataset updated
Nov 17, 2025
Authors
Damien Charlotin
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
A curated database of legal cases where generative AI produced hallucinated citations submitted in court filings.
Customer Shopping Trends Dataset
kaggle.com
zip
Updated Oct 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sourav Banerjee (2023). Customer Shopping Trends Dataset [Dataset]. https://www.kaggle.com/datasets/iamsouravbanerjee/customer-shopping-trends-dataset
Explore at:
zip(149846 bytes)Available download formats
Dataset updated
Oct 5, 2023
Authors
Sourav Banerjee
Description
Context

The Customer Shopping Preferences Dataset offers valuable insights into consumer behavior and purchasing patterns. Understanding customer preferences and trends is critical for businesses to tailor their products, marketing strategies, and overall customer experience. This dataset captures a wide range of customer attributes including age, gender, purchase history, preferred payment methods, frequency of purchases, and more. Analyzing this data can help businesses make informed decisions, optimize product offerings, and enhance customer satisfaction. The dataset stands as a valuable resource for businesses aiming to align their strategies with customer needs and preferences. It's important to note that this dataset is a Synthetic Dataset Created for Beginners to learn more about Data Analysis and Machine Learning.

Content

This dataset encompasses various features related to customer shopping preferences, gathering essential information for businesses seeking to enhance their understanding of their customer base. The features include customer age, gender, purchase amount, preferred payment methods, frequency of purchases, and feedback ratings. Additionally, data on the type of items purchased, shopping frequency, preferred shopping seasons, and interactions with promotional offers is included. With a collection of 3900 records, this dataset serves as a foundation for businesses looking to apply data-driven insights for better decision-making and customer-centric strategies.

Dataset Glossary (Column-wise)

Customer ID - Unique identifier for each customer

Age - Age of the customer

Gender - Gender of the customer (Male/Female)

Item Purchased - The item purchased by the customer

Category - Category of the item purchased

Purchase Amount (USD) - The amount of the purchase in USD

Location - Location where the purchase was made

Size - Size of the purchased item

Color - Color of the purchased item

Season - Season during which the purchase was made

Review Rating - Rating given by the customer for the purchased item

Subscription Status - Indicates if the customer has a subscription (Yes/No)

Shipping Type - Type of shipping chosen by the customer

Discount Applied - Indicates if a discount was applied to the purchase (Yes/No)

Promo Code Used - Indicates if a promo code was used for the purchase (Yes/No)

Previous Purchases - The total count of transactions concluded by the customer at the store, excluding the ongoing transaction

Payment Method - Customer's most preferred payment method

Frequency of Purchases - Frequency at which the customer makes purchases (e.g., Weekly, Fortnightly, Monthly)

Structure of the Dataset

https://i.imgur.com/6UEqejq.png" alt="">

Acknowledgement

This dataset is a synthetic creation generated using ChatGPT to simulate a realistic customer shopping experience. Its purpose is to provide a platform for beginners and data enthusiasts, allowing them to create, enjoy, practice, and learn from a dataset that mirrors real-world customer shopping behavior. The aim is to foster learning and experimentation in a simulated environment, encouraging a deeper understanding of data analysis and interpretation in the context of consumer preferences and retail scenarios.

Cover Photo by: Freepik

Thumbnail by: Clothing icons created by Flat Icons - Flaticon
Hotel Reservations Data
kaggle.com
zip
Updated Mar 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dimitris Angelides (2024). Hotel Reservations Data [Dataset]. https://www.kaggle.com/datasets/dimitrisangelide/hotel-reservations-data
Explore at:
zip(2567615 bytes)Available download formats
Dataset updated
Mar 4, 2024
Authors
Dimitris Angelides
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Tourism and travel holds more than 10% of the GDP worldwide, and is trending towards capturing higher stakes of the global pie. At the same time, it's an industry that generates huge volume of data and getting advantage of it could help businesses to stand out from the crowd.

Content

The dataset provides reservations data for two consecutive seasons (2021 - 2023) of a luxury hotel.

Source

ChatGPT 3.5 (OpenAI) is the main creator of the dataset. Minor adjustments were performed by myself to ensure that the dataset contains the desired fields and values.

Inspiration

• How effectively is the hotel performing across key metrics? • How are bookings distributed across different channels (e.g., Booking Platform, Phone, Walk-in, and Website)? • What is the current occupancy rate and how does it compare to the same period last year? • What are the demographics of the current guests (e.g., nationality)? • What is the average daily rate (ADR) per room?

These are examples of interesting questions that could be answered by analyzing this dataset.

If you are interested, please have a look at the Tableau dashboard that I have created to help answer the above questions. Tableau dashboard: https://public.tableau.com/app/profile/dimitris.angelides/viz/HotelExecutiveDashboards/HotelExecutiveSummaryReport?publish=yes
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

The Devastator (2023). Databricks Dolly (15K) [Dataset]. https://www.kaggle.com/datasets/thedevastator/databricks-chatgpt-dataset/code

Databricks Dolly (15K)

Over 15,000 Language Models and Dialogues for Interactive Chat Applications

Explore at:

247 scholarly articles cite this dataset (View in Google Scholar)

zip(4621394 bytes)Available download formats

Dataset updated

Nov 24, 2023

Authors

The Devastator

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Databricks Dolly (15K)

Over 15,000 Language Models and Dialogues for Interactive Chat Applications

By Huggingface Hub [source]

About this dataset

This exceptional dataset, created by Databricks employees, provides 15,000+ language models and dialogues to power dynamic ChatGPT applications. By generating prompt-response pairs from 8 different instruction categories, our goal is to facilitate the use of large language models for interactive dialogue interactions—all while avoiding information taken from any web sources except Wikipedia for particular instruction sets. Use this open-source dataset to explore the boundaries of text-based conversations and uncover new insights about natural language processing!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

First, let's take a look at the columns in this dataset: Instruction (string), Context (string), Response (string), Category (string). Each record represents a prompt-response pair or conversation between two people. The Instruction and Context fields contain what is said by one individual and the Response holds what is said back by another, culminating in a conversation. These paired entries are then classified into one of 8 different categories based on their content. Knowing this information can help you best utilize the corpus to your desired purposes.

For example: if you are training a dialogue system you could develop multiple funneling pipelines using this dataset to enrich your model with real-world conversations or create intelligent chatbot interactions. If you want to generate natural language answers as part of Q&A systems then you could utilize excerpts from Wikipedia for particular subsets of instruction categories as well drawing upon prompt-response pairs within those given instructions all from within the Databricks set. Furthermore, since each record is independently labeled into one of 8 defined categories - such as make reservations or compare products - there are many possibilities for leveraging these classification labels with supervised learning techniques such as multi-class classification neural networks or logistic regression classifiers.

In short, this substantial resource offers an array of creative ways to explore different types of dialogue related applications without being limited by needing data from external web sources – all that’s needed from here is your own imagination!

Research Ideas

Generating deep learning models to detect and respond to conversational intent.

Training language models to use natural language processing (NLP) for customer service queries.

Creating custom dialogue agents that are better able to handle more complex conversational interactions, such as those powered by machine learning techniques like supervised or unsupervised learning methods

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:----------------|:----------------------------------------------------------------------------------------------------------------------------------------------------| | Instruction | Text prompt that should generate an appropriate response from a machine learning model/chatbot using natural language processing techniques. (Text) | | Context | Provides context to improve accuracy by giving the model more information about what’s happening in a conversation or request execution. (Text) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

Clear search

Close search

Google apps

Main menu

Databricks Dolly (15K)

Databricks Dolly (15K)

Over 15,000 Language Models and Dialogues for Interactive Chat Applications

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

Job Dataset

Job Dataset

Descriptions for each of the columns in the dataset:

Potential Use Cases:

Acknowledgements:

Note:

AI Hallucination Cases Database

Customer Shopping Trends Dataset

Context

Content

Dataset Glossary (Column-wise)

Structure of the Dataset

Acknowledgement

Hotel Reservations Data

Databricks Dolly (15K)

Over 15,000 Language Models and Dialogues for Interactive Chat Applications

Databricks Dolly (15K)

Over 15,000 Language Models and Dialogues for Interactive Chat Applications

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements