6 datasets found

#ChatGPT 1000 Daily 🐦 Tweets
kaggle.com
Updated May 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Enric Domingo (2023). #ChatGPT 1000 Daily 🐦 Tweets [Dataset]. http://doi.org/10.34740/kaggle/dsv/5685262
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/5685262
Dataset updated
May 14, 2023
Dataset provided by
Kaggle
Authors
Enric Domingo
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
UPDATE: Due to new Twitter API conditions changed by Elon Musk, now it's no longer free to use the Twitter (X) API and the pricing is 100 $/month in the hobby plan. So my automated ETL notebook stopped from updating new tweets to this dataset on May 13th 2023.

This dataset is was updated everyday with the addition of 1000 tweets/day containing any of the words "ChatGPT", "GPT3", or "GPT4", starting from the 3rd of April 2023. Everyday's tweets are uploaded 24-72h later, so the counter on tweets' likes, retweets, messages and impressions gets enough time to be relevant. Tweets are from any language selected randomly from all hours of the day. There are some basic filters applied trying to discard sensitive tweets and spam.

This dataset can be used for many different applications regarding to Data Analysis and Visualization but also NLP Sentiment Analysis techniques and more.

Consider upvoting this Dataset and the ETL scheduled Notebook providing new data everyday into it if you found them interesting, thanks! 🤗

Columns Description:

tweet_id: Integer. unique identifier for each tweet. Older tweets have smaller IDs.

tweet_created: Timestamp. Time of the tweet's creation.

tweet_extracted: Timestamp. The UTC time when the ETL pipeline pulled the tweet and its metadata (likes count, retweets count, etc).

text: String. The raw payload text from the tweet.

lang: String. Short name for the Tweet text's language.

user_id: Integer. Twitter's unique user id.

user_name: String. The author's public name on Twitter.

user_username: String. The author's Twitter account username (@example)

user_location: String. The author's public location.

user_description: String. The author's public profile's bio.

user_created: Timestamp. Timestamp of user's Twitter account creation.

user_followers_count: Integer. The number of followers of the author's account at the moment of the tweet extraction

user_following_count: Integer. The number of followed accounts from the author's account at the moment of the Tweet extraction

user_tweet_count: Integer. The number of Tweets that the author has published at the moment of the Tweet extraction.

user_verified: Boolean. True if the user is verified (blue mark).

source: The device/app used to publish the tweet (Apparently not working, all values are Nan so far).

retweet_count: Integer. Number of retweets to the Tweet at the moment of the Tweet extraction.

like_count: Integer. Number of Likes to the Tweet at the moment of the Tweet extraction.

reply_count: Integer. Number of reply messages to the Tweet.

impression_count: Integer. Number of times the Tweet has been seen at the moment of the Tweet extraction.

More info: Tweets API info definition: https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/tweet Users API info definition: https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/user
o
ChatGPT Social Media Insights Dataset
opendatabay.com
.undefined
Updated Jul 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). ChatGPT Social Media Insights Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/2cf951da-3ce1-4606-a8d6-3f865c4d8a3b
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 6, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Social Media and Networking
Description
This dataset captures a daily collection of tweets containing keywords such as "ChatGPT", "GPT3", or "GPT4". It was designed to provide a rich source of social media data for analysis, particularly for applications concerning Natural Language Processing (NLP) and sentiment analysis. The collection process began on 3rd April 2023, with approximately 1,000 tweets added daily. Tweets were extracted 24-72 hours after creation to allow for relevant engagement metrics like likes and retweets to accumulate. However, updates to this dataset ceased on 13th May 2023, due to changes in Twitter (X) API conditions, which introduced a cost for its use. The dataset includes tweets from various languages, selected randomly throughout the day, with basic filters applied to discard sensitive content and spam.

Columns

tweet_id: An integer serving as a unique identifier for each tweet. Older tweets typically have smaller IDs.

tweet_created: A timestamp indicating the exact time the tweet was published.

tweet_extracted: A UTC timestamp recording when the ETL (Extract, Transform, Load) pipeline pulled the tweet and its associated metadata (e.g., likes count, retweets count).

text: A string containing the raw text content of the tweet payload.

lang: A string providing the short name for the language of the tweet's text.

user_id: An integer representing the author's unique user ID on Twitter.

user_name: A string displaying the author's public name on Twitter.

user_username: A string showing the author's Twitter account username (e.g., @example).

user_location: A string detailing the author's publicly stated location.

user_description: A string containing the author's public profile biography.

user_created: A timestamp indicating when the user's Twitter account was created.

user_followers_count: An integer showing the number of followers the author's account had at the moment the tweet was extracted.

user_following_count: An integer indicating the number of accounts the author was following at the moment of tweet extraction.

user_tweet_count: An integer representing the total number of tweets the author had published at the time of tweet extraction.

user_verified: A boolean value (True/False) indicating if the user is verified (i.e., has a blue tick).

source: This column was intended to show the device or application used to publish the tweet but currently contains only 'Nan' (Not a Number) values.

retweet_count: An integer displaying the number of times the tweet had been retweeted at the moment of extraction.

like_count: An integer showing the number of likes the tweet had received at the moment of extraction.

reply_count: An integer indicating the number of reply messages to the tweet.

impression_count: An integer representing the number of times the tweet had been seen at the moment of extraction.

Distribution

The dataset is provided in a CSV file format, generated from a Pandas DataFrame, with each row containing the tweet's text and its metadata, along with the author's information. The collection started on 3rd April 2023, adding approximately 1,000 tweets per day, and stopped updating on 13th May 2023. While specific total row counts are not available, various segments show substantial data, such as 43,000 tweets collected between 22nd September 2022 and 12th May 2023. Daily additions of 1,000 to 7,000 tweets are noted for the period of 8th April 2023 to 14th May 2023. The dataset includes unique values for over 25,000 tweet IDs, over 37,000 unique user IDs, and over 38,000 unique user locations.

Usage

This dataset is ideal for various data analysis and visualisation applications. It is particularly well-suited for Natural Language Processing (NLP) techniques, including sentiment analysis, to understand public opinion and trends related to ChatGPT, GPT3, and GPT4. Researchers can use it for social media listening, trend tracking, and studying the evolution of discussions around large language models.

Coverage

The dataset primarily covers tweets from 3rd April 2023 to 13th May 2023, with some older tweets included, particularly from September 2022. Tweets are from any language, randomly selected globally. English (en) tweets constitute approximately 48% of the dataset, Japanese (ja) tweets make up about 23%, and other languages account for 30%. User locations vary widely, with a significant portion (41%) being null, 1% from Japan, and the remaining 59% from various other global locations.

License

CC0

Who Can Use It

Data Analysts: For exploring social media trends and user engagement related to AI.

Researchers: Studying the public reception, discussion patterns, and sentiment around large language models.

Machine Learning Engineers: Developing and testing NLP models for s
Z
Dolly 15k Dutch
data.niaid.nih.gov
Updated Jun 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vanroy, Bram (2023). Dolly 15k Dutch [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8054097
Explore at:
Dataset updated
Jun 20, 2023
Dataset authored and provided by
Vanroy, Bram
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
This dataset contains 14,934 instructions, contexts and responses, in several natural language categories such as classification, closed QA, generation, etc. The English original dataset was created by @databricks, who crowd-sourced the data creation via its employees. The current dataset is a translation of that dataset through ChatGPT (gpt-3.5-turbo).

Data Instances

{ "id": 14963, "instruction": "Wat zijn de duurste steden ter wereld?", "context": "", "response": "Dit is een uitgebreide lijst van de duurste steden: Singapore, Tel Aviv, New York, Hong Kong, Los Angeles, Zurich, Genève, San Francisco, Parijs en Sydney.", "category": "brainstorming" }

Data Fields

id: the ID of the item. The following 77 IDs are not included because they could not be translated (or were too long): [1502, 1812, 1868, 4179, 4541, 6347, 8851, 9321, 10588, 10835, 11257, 12082, 12319, 12471, 12701, 12988, 13066, 13074, 13076, 13181, 13253, 13279, 13313, 13346, 13369, 13446, 13475, 13528, 13546, 13548, 13549, 13558, 13566, 13600, 13603, 13657, 13668, 13733, 13765, 13775, 13801, 13831, 13906, 13922, 13923, 13957, 13967, 13976, 14028, 14031, 14045, 14050, 14082, 14083, 14089, 14110, 14155, 14162, 14181, 14187, 14200, 14221, 14222, 14281, 14473, 14475, 14476, 14587, 14590, 14667, 14685, 14764, 14780, 14808, 14836, 14891, 1 4966]

instruction: the instruction (question)

context: additional context that the AI can use to answer the question

response: the AI's expected response

category: the category of this type of question (see Dolly for more info)

Dataset Creation

Both the translations and the topics were translated with OpenAI's API for gpt-3.5-turbo. max_tokens=1024, temperature=0 as parameters.

The prompt template to translate the input is (where src_lang was English and tgt_lang Dutch):

CONVERSATION_TRANSLATION_PROMPT = """You are asked to translate a task's instruction, optional context to the task, and the response to the task, from {src_lang} to {tgt_lang}.

Here are the requirements that you should adhere to: 1. maintain the format: the task consists of a task instruction (marked instruction:), optional context to the task (marked context:) and response for the task marked with response:; 2. do not translate the identifiers instruction:, context:, and response: but instead copy them to your output; 3. make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias; 4. translate the instruction and context text using informal, but standard, language; 5. make sure to avoid biases (such as gender bias, grammatical bias, social bias); 6. if the instruction is to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in the context in {tgt_lang}, and then also generate a corrected output version in the output in {tgt_lang}; 7. if the instruction is to translate text from one language to another, then you do not translate the text that needs to be translated in the instruction or the context, nor the translation in the response (just copy them as-is); 8. do not translate code fragments but copy them to your output. If there are English examples, variable names or definitions in code fragments, keep them in English.

Now translate the following task with the requirements set out above. Do not provide an explanation and do not add anything else.

"""

The system message was:

You are a helpful assistant that translates English to Dutch according to the requirements that are given to you.

Note that 77 items (0.5%) were not successfully translated. This can either mean that the prompt was too long for the given limit (max_tokens=1024) or that the generated translation could not be parsed into instruction, context and response fields. The missing IDs are [1502, 1812, 1868, 4179, 4541, 6347, 8851, 9321, 10588, 10835, 11257, 12082, 12319, 12471, 12701, 12988, 13066, 13074, 13076, 13181, 13253, 13279, 13313, 13346, 13369, 13446, 13475, 13528, 13546, 13548, 13549, 13558, 13566, 13600, 13603, 13657, 13668, 13733, 13765, 13775, 13801, 13831, 13906, 13922, 13923, 13957, 13967, 13976, 14028, 14031, 14045, 14050, 14082, 14083, 14089, 14110, 14155, 14162, 14181, 14187, 14200, 14221, 14222, 14281, 14473, 14475, 14476, 14587, 14590, 14667, 14685, 14764, 14780, 14808, 14836, 14891, 1 4966].

Initial Data Collection and Normalization

Initial data collection by databricks. See their repository for more information about this dataset.

Considerations for Using the Data

Note that the translations in this new dataset have not been verified by humans! Use at your own risk, both in terms of quality and biases.

Discussion of Biases

As with any machine-generated texts, users should be aware of potential biases that are included in this dataset. Although the prompt specifically includes make sure to avoid biases (such as gender bias, grammatical bias, social bias), of course the impact of such command is not known. It is likely that biases remain in the dataset so use with caution.

Other Known Limitations

The translation quality has not been verified. Use at your own risk!

Licensing Information

This repository follows the original databricks license, which is CC BY-SA 3.0 but see below for a specific restriction.

This text was generated (either in part or in full) with GPT-3 (gpt-3.5-turbo), OpenAI’s large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication.

If you use this dataset, you must also follow the Sharing and Usage policies.

As clearly stated in their Terms of Use, specifically 2c.iii, "[you may not] use output from the Services to develop models that compete with OpenAI". That means that you cannot use this dataset to build models that are intended to commercially compete with OpenAI. As far as I am aware, that is a specific restriction that should serve as an addendum to the current license.

This dataset is also available on the Hugging Face hub, its canonical repository.
H
Data from: Can we trust AI chatbots’ answers about disease diagnosis and...
dataverse.harvard.edu
dataone.org
Updated Apr 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sun Huh (2023). Can we trust AI chatbots’ answers about disease diagnosis and patient care? [Dataset]. http://doi.org/10.7910/DVN/LTKE1J
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/LTKE1J
Dataset updated
Apr 18, 2023
Dataset provided by
Harvard Dataverse
Authors
Sun Huh
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Background: Several chatbots that utilize large language models now exist. As a particularly well-known example, ChatGPT employs an autoregressive modeling process to generate responses, predicting the next word based on previously derived words. Consequently, instead of deducing a correct answer, it arranges the most frequently appearing words in the learned data in order. Optimized for interactivity and content generation, it presents a smooth and plausible context, regardless of whether the content it presents is true. This report aimed to examine the reliability of ChatGPT, an artificial intelligence (AI) chatbot, in diagnosing diseases and treating patients, how to interpret its responses, and directions for future development.Current Concepts: Ten published case reports from Korea were analyzed to evaluate the efficacy of ChatGPT, which was asked to describe the correct diagnosis and treatment. ChatGPT answered 3 cases correctly after being provided with the patient’s symptoms, findings, and medical history. The accuracy rate increased to 7 out of 10 after adding laboratory, pathological, and radiological results. In one case, ChatGPT did not provide appropriate information about suitable treatment, and its response contained inappropriate content in 4 cases. In contrast, ChatGPT recommended appropriate measures in 4 cases.Discussion and Conclusion: ChatGPT’s responses to the 10 case reports could have been better. To utilize ChatGPT efficiently and appropriately, users should possess sufficient knowledge and skills to determine the validity of its responses. AI chatbots based on large language models will progress significantly, but physicians must be vigilant in using these tools in practice.
B
Global Public Opinion on Artificial Intelligence (GPO-AI)
borealisdata.ca
Updated Mar 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Blake Lee-Whiting; Peter John Loewen; Thomas Bergeron (2025). Global Public Opinion on Artificial Intelligence (GPO-AI) [Dataset]. http://doi.org/10.5683/SP3/WCUN0S
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/WCUN0S
Dataset updated
Mar 5, 2025
Dataset provided by
Borealis
Authors
Blake Lee-Whiting; Peter John Loewen; Thomas Bergeron
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
In October and November 2023, researchers at the Schwartz Reisman Institute for Technology and Society and the Policy, Elections and Representation Lab at the Munk School of Global Affairs and Public Policy at the University of Toronto completed a survey on public perceptions of and attitudes toward AI. The survey was administered to over 1,000 people in each of 21 countries, for a total of 23,882 surveys conducted in 12 languages. The combined populations of the countries sampled represent a majority of the world's population. Countries: Argentina, Australia, Brazil, Canada, Chile, China, France, Germany, India, Indonesia, Italy, Japan, Kenya, Mexico, Pakistan, Poland, Portugal, South Africa, Spain, United Kingdom, United States of America Languages: Chinese (Simplified), English, French, German, Indonesian, Italian, Japanese, Polish, Portuguese (Portugal), Portuguese (Brazil), Spanish (Spain), Spanish (Latin America). The survey explored general knowledge of and attitudes toward AI. Topics included concerns about AI, safety, regulation, autonomous vehicles and AI's effect on jobs now and in the future. Participants were asked whether they are interested in or trust applications of AI for clothes, travel, grocery shopping, dating or finance. Respondents were asked about their attitudes toward the use of emerging technologies in education, the justice system, health care and immigration. Respondents were also asked about their knowledge of and experience with ChatGPT and deepfakes.
Data from: AI-Powered Knowledge Base Enables Transparent Prediction of...
acs.figshare.com
xlsx
Updated May 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julia Razlivina; Andrei Dmitrenko; Vladimir Vinogradov (2024). AI-Powered Knowledge Base Enables Transparent Prediction of Nanozyme Multiple Catalytic Activity [Dataset]. http://doi.org/10.1021/acs.jpclett.4c00959.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jpclett.4c00959.s002
Dataset updated
May 23, 2024
Dataset provided by
ACS Publications
Authors
Julia Razlivina; Andrei Dmitrenko; Vladimir Vinogradov
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Nanozymes are unique materials with many valuable properties for applications in biomedicine, biosensing, environmental monitoring, and beyond. In this work, we developed a machine learning (ML) approach to search for new nanozymes and deployed a web platform, DiZyme, featuring a state-of-the-art database of nanozymes containing 1210 experimental samples, catalytic activity prediction, and DiZyme Assistant interface powered by a large language model (LLM). For the first time, we enable the prediction of multiple catalytic activities of nanozymes by training an ensemble learning algorithm achieving R2 = 0.75 for the Michaelis–Menten constant and R2 = 0.77 for the maximum velocity on unseen test data. We envision an accurate prediction of multiple catalytic activities (peroxidase, oxidase, and catalase) promoting novel applications for a wide range of surface-modified inorganic nanozymes. The DiZyme Assistant based on the ChatGPT model provides users with supporting information on experimental samples, such as synthesis procedures, measurement protocols, etc. DiZyme (dizyme.aicidlab.itmo.ru) is now openly available worldwide.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Enric Domingo (2023). #ChatGPT 1000 Daily 🐦 Tweets [Dataset]. http://doi.org/10.34740/kaggle/dsv/5685262

#ChatGPT 1000 Daily 🐦 Tweets

1000 tweets a day about "ChatGPT", "GPT3", and "GPT4" with metadata

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.34740/kaggle/dsv/5685262

Dataset updated

May 14, 2023

Dataset provided by

Kaggle

Authors

Enric Domingo

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

UPDATE: Due to new Twitter API conditions changed by Elon Musk, now it's no longer free to use the Twitter (X) API and the pricing is 100 $/month in the hobby plan. So my automated ETL notebook stopped from updating new tweets to this dataset on May 13th 2023.

This dataset is was updated everyday with the addition of 1000 tweets/day containing any of the words "ChatGPT", "GPT3", or "GPT4", starting from the 3rd of April 2023. Everyday's tweets are uploaded 24-72h later, so the counter on tweets' likes, retweets, messages and impressions gets enough time to be relevant. Tweets are from any language selected randomly from all hours of the day. There are some basic filters applied trying to discard sensitive tweets and spam.

This dataset can be used for many different applications regarding to Data Analysis and Visualization but also NLP Sentiment Analysis techniques and more.

Consider upvoting this Dataset and the ETL scheduled Notebook providing new data everyday into it if you found them interesting, thanks! 🤗

Columns Description:

tweet_id: Integer. unique identifier for each tweet. Older tweets have smaller IDs.
tweet_created: Timestamp. Time of the tweet's creation.
tweet_extracted: Timestamp. The UTC time when the ETL pipeline pulled the tweet and its metadata (likes count, retweets count, etc).
text: String. The raw payload text from the tweet.
lang: String. Short name for the Tweet text's language.
user_id: Integer. Twitter's unique user id.
user_name: String. The author's public name on Twitter.
user_username: String. The author's Twitter account username (@example)
user_location: String. The author's public location.
user_description: String. The author's public profile's bio.
user_created: Timestamp. Timestamp of user's Twitter account creation.
user_followers_count: Integer. The number of followers of the author's account at the moment of the tweet extraction
user_following_count: Integer. The number of followed accounts from the author's account at the moment of the Tweet extraction
user_tweet_count: Integer. The number of Tweets that the author has published at the moment of the Tweet extraction.
user_verified: Boolean. True if the user is verified (blue mark).
source: The device/app used to publish the tweet (Apparently not working, all values are Nan so far).
retweet_count: Integer. Number of retweets to the Tweet at the moment of the Tweet extraction.
like_count: Integer. Number of Likes to the Tweet at the moment of the Tweet extraction.
reply_count: Integer. Number of reply messages to the Tweet.
impression_count: Integer. Number of times the Tweet has been seen at the moment of the Tweet extraction.

More info: Tweets API info definition: https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/tweet Users API info definition: https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/user

Clear search

Close search

Google apps

Main menu

#ChatGPT 1000 Daily 🐦 Tweets

Columns Description:

ChatGPT Social Media Insights Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dolly 15k Dutch

Data from: Can we trust AI chatbots’ answers about disease diagnosis and...

Global Public Opinion on Artificial Intelligence (GPO-AI)

Data from: AI-Powered Knowledge Base Enables Transparent Prediction of...

#ChatGPT 1000 Daily 🐦 Tweets

1000 tweets a day about "ChatGPT", "GPT3", and "GPT4" with metadata

Columns Description: