6 datasets found
  1. #ChatGPT 1000 Daily 🐦 Tweets

    • kaggle.com
    Updated May 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Enric Domingo (2023). #ChatGPT 1000 Daily 🐦 Tweets [Dataset]. http://doi.org/10.34740/kaggle/dsv/5685262
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 14, 2023
    Dataset provided by
    Kaggle
    Authors
    Enric Domingo
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    UPDATE: Due to new Twitter API conditions changed by Elon Musk, now it's no longer free to use the Twitter (X) API and the pricing is 100 $/month in the hobby plan. So my automated ETL notebook stopped from updating new tweets to this dataset on May 13th 2023.

    This dataset is was updated everyday with the addition of 1000 tweets/day containing any of the words "ChatGPT", "GPT3", or "GPT4", starting from the 3rd of April 2023. Everyday's tweets are uploaded 24-72h later, so the counter on tweets' likes, retweets, messages and impressions gets enough time to be relevant. Tweets are from any language selected randomly from all hours of the day. There are some basic filters applied trying to discard sensitive tweets and spam.

    This dataset can be used for many different applications regarding to Data Analysis and Visualization but also NLP Sentiment Analysis techniques and more.

    Consider upvoting this Dataset and the ETL scheduled Notebook providing new data everyday into it if you found them interesting, thanks! šŸ¤—

    Columns Description:

    • tweet_id: Integer. unique identifier for each tweet. Older tweets have smaller IDs.

    • tweet_created: Timestamp. Time of the tweet's creation.

    • tweet_extracted: Timestamp. The UTC time when the ETL pipeline pulled the tweet and its metadata (likes count, retweets count, etc).

    • text: String. The raw payload text from the tweet.

    • lang: String. Short name for the Tweet text's language.

    • user_id: Integer. Twitter's unique user id.

    • user_name: String. The author's public name on Twitter.

    • user_username: String. The author's Twitter account username (@example)

    • user_location: String. The author's public location.

    • user_description: String. The author's public profile's bio.

    • user_created: Timestamp. Timestamp of user's Twitter account creation.

    • user_followers_count: Integer. The number of followers of the author's account at the moment of the tweet extraction

    • user_following_count: Integer. The number of followed accounts from the author's account at the moment of the Tweet extraction

    • user_tweet_count: Integer. The number of Tweets that the author has published at the moment of the Tweet extraction.

    • user_verified: Boolean. True if the user is verified (blue mark).

    • source: The device/app used to publish the tweet (Apparently not working, all values are Nan so far).

    • retweet_count: Integer. Number of retweets to the Tweet at the moment of the Tweet extraction.

    • like_count: Integer. Number of Likes to the Tweet at the moment of the Tweet extraction.

    • reply_count: Integer. Number of reply messages to the Tweet.

    • impression_count: Integer. Number of times the Tweet has been seen at the moment of the Tweet extraction.

    More info: Tweets API info definition: https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/tweet Users API info definition: https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/user

  2. o

    ChatGPT Social Media Insights Dataset

    • opendatabay.com
    .undefined
    Updated Jul 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). ChatGPT Social Media Insights Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/2cf951da-3ce1-4606-a8d6-3f865c4d8a3b
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 6, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Social Media and Networking
    Description

    This dataset captures a daily collection of tweets containing keywords such as "ChatGPT", "GPT3", or "GPT4". It was designed to provide a rich source of social media data for analysis, particularly for applications concerning Natural Language Processing (NLP) and sentiment analysis. The collection process began on 3rd April 2023, with approximately 1,000 tweets added daily. Tweets were extracted 24-72 hours after creation to allow for relevant engagement metrics like likes and retweets to accumulate. However, updates to this dataset ceased on 13th May 2023, due to changes in Twitter (X) API conditions, which introduced a cost for its use. The dataset includes tweets from various languages, selected randomly throughout the day, with basic filters applied to discard sensitive content and spam.

    Columns

    • tweet_id: An integer serving as a unique identifier for each tweet. Older tweets typically have smaller IDs.
    • tweet_created: A timestamp indicating the exact time the tweet was published.
    • tweet_extracted: A UTC timestamp recording when the ETL (Extract, Transform, Load) pipeline pulled the tweet and its associated metadata (e.g., likes count, retweets count).
    • text: A string containing the raw text content of the tweet payload.
    • lang: A string providing the short name for the language of the tweet's text.
    • user_id: An integer representing the author's unique user ID on Twitter.
    • user_name: A string displaying the author's public name on Twitter.
    • user_username: A string showing the author's Twitter account username (e.g., @example).
    • user_location: A string detailing the author's publicly stated location.
    • user_description: A string containing the author's public profile biography.
    • user_created: A timestamp indicating when the user's Twitter account was created.
    • user_followers_count: An integer showing the number of followers the author's account had at the moment the tweet was extracted.
    • user_following_count: An integer indicating the number of accounts the author was following at the moment of tweet extraction.
    • user_tweet_count: An integer representing the total number of tweets the author had published at the time of tweet extraction.
    • user_verified: A boolean value (True/False) indicating if the user is verified (i.e., has a blue tick).
    • source: This column was intended to show the device or application used to publish the tweet but currently contains only 'Nan' (Not a Number) values.
    • retweet_count: An integer displaying the number of times the tweet had been retweeted at the moment of extraction.
    • like_count: An integer showing the number of likes the tweet had received at the moment of extraction.
    • reply_count: An integer indicating the number of reply messages to the tweet.
    • impression_count: An integer representing the number of times the tweet had been seen at the moment of extraction.

    Distribution

    The dataset is provided in a CSV file format, generated from a Pandas DataFrame, with each row containing the tweet's text and its metadata, along with the author's information. The collection started on 3rd April 2023, adding approximately 1,000 tweets per day, and stopped updating on 13th May 2023. While specific total row counts are not available, various segments show substantial data, such as 43,000 tweets collected between 22nd September 2022 and 12th May 2023. Daily additions of 1,000 to 7,000 tweets are noted for the period of 8th April 2023 to 14th May 2023. The dataset includes unique values for over 25,000 tweet IDs, over 37,000 unique user IDs, and over 38,000 unique user locations.

    Usage

    This dataset is ideal for various data analysis and visualisation applications. It is particularly well-suited for Natural Language Processing (NLP) techniques, including sentiment analysis, to understand public opinion and trends related to ChatGPT, GPT3, and GPT4. Researchers can use it for social media listening, trend tracking, and studying the evolution of discussions around large language models.

    Coverage

    The dataset primarily covers tweets from 3rd April 2023 to 13th May 2023, with some older tweets included, particularly from September 2022. Tweets are from any language, randomly selected globally. English (en) tweets constitute approximately 48% of the dataset, Japanese (ja) tweets make up about 23%, and other languages account for 30%. User locations vary widely, with a significant portion (41%) being null, 1% from Japan, and the remaining 59% from various other global locations.

    License

    CC0

    Who Can Use It

    • Data Analysts: For exploring social media trends and user engagement related to AI.
    • Researchers: Studying the public reception, discussion patterns, and sentiment around large language models.
    • Machine Learning Engineers: Developing and testing NLP models for s
  3. Z

    Dolly 15k Dutch

    • data.niaid.nih.gov
    Updated Jun 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vanroy, Bram (2023). Dolly 15k Dutch [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8054097
    Explore at:
    Dataset updated
    Jun 20, 2023
    Dataset authored and provided by
    Vanroy, Bram
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    This dataset contains 14,934 instructions, contexts and responses, in several natural language categories such as classification, closed QA, generation, etc. The English original dataset was created by @databricks, who crowd-sourced the data creation via its employees. The current dataset is a translation of that dataset through ChatGPT (gpt-3.5-turbo).

    Data Instances

    { "id": 14963, "instruction": "Wat zijn de duurste steden ter wereld?", "context": "", "response": "Dit is een uitgebreide lijst van de duurste steden: Singapore, Tel Aviv, New York, Hong Kong, Los Angeles, Zurich, GenĆØve, San Francisco, Parijs en Sydney.", "category": "brainstorming" }

    Data Fields

    id: the ID of the item. The following 77 IDs are not included because they could not be translated (or were too long): [1502, 1812, 1868, 4179, 4541, 6347, 8851, 9321, 10588, 10835, 11257, 12082, 12319, 12471, 12701, 12988, 13066, 13074, 13076, 13181, 13253, 13279, 13313, 13346, 13369, 13446, 13475, 13528, 13546, 13548, 13549, 13558, 13566, 13600, 13603, 13657, 13668, 13733, 13765, 13775, 13801, 13831, 13906, 13922, 13923, 13957, 13967, 13976, 14028, 14031, 14045, 14050, 14082, 14083, 14089, 14110, 14155, 14162, 14181, 14187, 14200, 14221, 14222, 14281, 14473, 14475, 14476, 14587, 14590, 14667, 14685, 14764, 14780, 14808, 14836, 14891, 1 4966]

    instruction: the instruction (question)

    context: additional context that the AI can use to answer the question

    response: the AI's expected response

    category: the category of this type of question (see Dolly for more info)

    Dataset Creation

    Both the translations and the topics were translated with OpenAI's API for gpt-3.5-turbo. max_tokens=1024, temperature=0 as parameters.

    The prompt template to translate the input is (where src_lang was English and tgt_lang Dutch):

    CONVERSATION_TRANSLATION_PROMPT = """You are asked to translate a task's instruction, optional context to the task, and the response to the task, from {src_lang} to {tgt_lang}.

    Here are the requirements that you should adhere to: 1. maintain the format: the task consists of a task instruction (marked instruction:), optional context to the task (marked context:) and response for the task marked with response:; 2. do not translate the identifiers instruction:, context:, and response: but instead copy them to your output; 3. make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias; 4. translate the instruction and context text using informal, but standard, language; 5. make sure to avoid biases (such as gender bias, grammatical bias, social bias); 6. if the instruction is to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in the context in {tgt_lang}, and then also generate a corrected output version in the output in {tgt_lang}; 7. if the instruction is to translate text from one language to another, then you do not translate the text that needs to be translated in the instruction or the context, nor the translation in the response (just copy them as-is); 8. do not translate code fragments but copy them to your output. If there are English examples, variable names or definitions in code fragments, keep them in English.

    Now translate the following task with the requirements set out above. Do not provide an explanation and do not add anything else.

    """

    The system message was:

    You are a helpful assistant that translates English to Dutch according to the requirements that are given to you.

    Note that 77 items (0.5%) were not successfully translated. This can either mean that the prompt was too long for the given limit (max_tokens=1024) or that the generated translation could not be parsed into instruction, context and response fields. The missing IDs are [1502, 1812, 1868, 4179, 4541, 6347, 8851, 9321, 10588, 10835, 11257, 12082, 12319, 12471, 12701, 12988, 13066, 13074, 13076, 13181, 13253, 13279, 13313, 13346, 13369, 13446, 13475, 13528, 13546, 13548, 13549, 13558, 13566, 13600, 13603, 13657, 13668, 13733, 13765, 13775, 13801, 13831, 13906, 13922, 13923, 13957, 13967, 13976, 14028, 14031, 14045, 14050, 14082, 14083, 14089, 14110, 14155, 14162, 14181, 14187, 14200, 14221, 14222, 14281, 14473, 14475, 14476, 14587, 14590, 14667, 14685, 14764, 14780, 14808, 14836, 14891, 1 4966].

    Initial Data Collection and Normalization

    Initial data collection by databricks. See their repository for more information about this dataset.

    Considerations for Using the Data

    Note that the translations in this new dataset have not been verified by humans! Use at your own risk, both in terms of quality and biases.

    Discussion of Biases

    As with any machine-generated texts, users should be aware of potential biases that are included in this dataset. Although the prompt specifically includes make sure to avoid biases (such as gender bias, grammatical bias, social bias), of course the impact of such command is not known. It is likely that biases remain in the dataset so use with caution.

    Other Known Limitations

    The translation quality has not been verified. Use at your own risk!

    Licensing Information

    This repository follows the original databricks license, which is CC BY-SA 3.0 but see below for a specific restriction.

    This text was generated (either in part or in full) with GPT-3 (gpt-3.5-turbo), OpenAI’s large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication.

    If you use this dataset, you must also follow the Sharing and Usage policies.

    As clearly stated in their Terms of Use, specifically 2c.iii, "[you may not] use output from the Services to develop models that compete with OpenAI". That means that you cannot use this dataset to build models that are intended to commercially compete with OpenAI. As far as I am aware, that is a specific restriction that should serve as an addendum to the current license.

    This dataset is also available on the Hugging Face hub, its canonical repository.

  4. H

    Data from: Can we trust AI chatbots’ answers about disease diagnosis and...

    • dataverse.harvard.edu
    • dataone.org
    Updated Apr 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sun Huh (2023). Can we trust AI chatbots’ answers about disease diagnosis and patient care? [Dataset]. http://doi.org/10.7910/DVN/LTKE1J
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 18, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Sun Huh
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Background: Several chatbots that utilize large language models now exist. As a particularly well-known example, ChatGPT employs an autoregressive modeling process to generate responses, predicting the next word based on previously derived words. Consequently, instead of deducing a correct answer, it arranges the most frequently appearing words in the learned data in order. Optimized for interactivity and content generation, it presents a smooth and plausible context, regardless of whether the content it presents is true. This report aimed to examine the reliability of ChatGPT, an artificial intelligence (AI) chatbot, in diagnosing diseases and treating patients, how to interpret its responses, and directions for future development.Current Concepts: Ten published case reports from Korea were analyzed to evaluate the efficacy of ChatGPT, which was asked to describe the correct diagnosis and treatment. ChatGPT answered 3 cases correctly after being provided with the patient’s symptoms, findings, and medical history. The accuracy rate increased to 7 out of 10 after adding laboratory, pathological, and radiological results. In one case, ChatGPT did not provide appropriate information about suitable treatment, and its response contained inappropriate content in 4 cases. In contrast, ChatGPT recommended appropriate measures in 4 cases.Discussion and Conclusion: ChatGPT’s responses to the 10 case reports could have been better. To utilize ChatGPT efficiently and appropriately, users should possess sufficient knowledge and skills to determine the validity of its responses. AI chatbots based on large language models will progress significantly, but physicians must be vigilant in using these tools in practice.

  5. B

    Global Public Opinion on Artificial Intelligence (GPO-AI)

    • borealisdata.ca
    Updated Mar 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Blake Lee-Whiting; Peter John Loewen; Thomas Bergeron (2025). Global Public Opinion on Artificial Intelligence (GPO-AI) [Dataset]. http://doi.org/10.5683/SP3/WCUN0S
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 5, 2025
    Dataset provided by
    Borealis
    Authors
    Blake Lee-Whiting; Peter John Loewen; Thomas Bergeron
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    In October and November 2023, researchers at the Schwartz Reisman Institute for Technology and Society and the Policy, Elections and Representation Lab at the Munk School of Global Affairs and Public Policy at the University of Toronto completed a survey on public perceptions of and attitudes toward AI. The survey was administered to over 1,000 people in each of 21 countries, for a total of 23,882 surveys conducted in 12 languages. The combined populations of the countries sampled represent a majority of the world's population. Countries: Argentina, Australia, Brazil, Canada, Chile, China, France, Germany, India, Indonesia, Italy, Japan, Kenya, Mexico, Pakistan, Poland, Portugal, South Africa, Spain, United Kingdom, United States of America Languages: Chinese (Simplified), English, French, German, Indonesian, Italian, Japanese, Polish, Portuguese (Portugal), Portuguese (Brazil), Spanish (Spain), Spanish (Latin America). The survey explored general knowledge of and attitudes toward AI. Topics included concerns about AI, safety, regulation, autonomous vehicles and AI's effect on jobs now and in the future. Participants were asked whether they are interested in or trust applications of AI for clothes, travel, grocery shopping, dating or finance. Respondents were asked about their attitudes toward the use of emerging technologies in education, the justice system, health care and immigration. Respondents were also asked about their knowledge of and experience with ChatGPT and deepfakes.

  6. Data from: AI-Powered Knowledge Base Enables Transparent Prediction of...

    • acs.figshare.com
    xlsx
    Updated May 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julia Razlivina; Andrei Dmitrenko; Vladimir Vinogradov (2024). AI-Powered Knowledge Base Enables Transparent Prediction of Nanozyme Multiple Catalytic Activity [Dataset]. http://doi.org/10.1021/acs.jpclett.4c00959.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 23, 2024
    Dataset provided by
    ACS Publications
    Authors
    Julia Razlivina; Andrei Dmitrenko; Vladimir Vinogradov
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Nanozymes are unique materials with many valuable properties for applications in biomedicine, biosensing, environmental monitoring, and beyond. In this work, we developed a machine learning (ML) approach to search for new nanozymes and deployed a web platform, DiZyme, featuring a state-of-the-art database of nanozymes containing 1210 experimental samples, catalytic activity prediction, and DiZyme Assistant interface powered by a large language model (LLM). For the first time, we enable the prediction of multiple catalytic activities of nanozymes by training an ensemble learning algorithm achieving R2 = 0.75 for the Michaelis–Menten constant and R2 = 0.77 for the maximum velocity on unseen test data. We envision an accurate prediction of multiple catalytic activities (peroxidase, oxidase, and catalase) promoting novel applications for a wide range of surface-modified inorganic nanozymes. The DiZyme Assistant based on the ChatGPT model provides users with supporting information on experimental samples, such as synthesis procedures, measurement protocols, etc. DiZyme (dizyme.aicidlab.itmo.ru) is now openly available worldwide.

  7. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Enric Domingo (2023). #ChatGPT 1000 Daily 🐦 Tweets [Dataset]. http://doi.org/10.34740/kaggle/dsv/5685262
Organization logo

#ChatGPT 1000 Daily 🐦 Tweets

1000 tweets a day about "ChatGPT", "GPT3", and "GPT4" with metadata

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 14, 2023
Dataset provided by
Kaggle
Authors
Enric Domingo
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

UPDATE: Due to new Twitter API conditions changed by Elon Musk, now it's no longer free to use the Twitter (X) API and the pricing is 100 $/month in the hobby plan. So my automated ETL notebook stopped from updating new tweets to this dataset on May 13th 2023.

This dataset is was updated everyday with the addition of 1000 tweets/day containing any of the words "ChatGPT", "GPT3", or "GPT4", starting from the 3rd of April 2023. Everyday's tweets are uploaded 24-72h later, so the counter on tweets' likes, retweets, messages and impressions gets enough time to be relevant. Tweets are from any language selected randomly from all hours of the day. There are some basic filters applied trying to discard sensitive tweets and spam.

This dataset can be used for many different applications regarding to Data Analysis and Visualization but also NLP Sentiment Analysis techniques and more.

Consider upvoting this Dataset and the ETL scheduled Notebook providing new data everyday into it if you found them interesting, thanks! šŸ¤—

Columns Description:

  • tweet_id: Integer. unique identifier for each tweet. Older tweets have smaller IDs.

  • tweet_created: Timestamp. Time of the tweet's creation.

  • tweet_extracted: Timestamp. The UTC time when the ETL pipeline pulled the tweet and its metadata (likes count, retweets count, etc).

  • text: String. The raw payload text from the tweet.

  • lang: String. Short name for the Tweet text's language.

  • user_id: Integer. Twitter's unique user id.

  • user_name: String. The author's public name on Twitter.

  • user_username: String. The author's Twitter account username (@example)

  • user_location: String. The author's public location.

  • user_description: String. The author's public profile's bio.

  • user_created: Timestamp. Timestamp of user's Twitter account creation.

  • user_followers_count: Integer. The number of followers of the author's account at the moment of the tweet extraction

  • user_following_count: Integer. The number of followed accounts from the author's account at the moment of the Tweet extraction

  • user_tweet_count: Integer. The number of Tweets that the author has published at the moment of the Tweet extraction.

  • user_verified: Boolean. True if the user is verified (blue mark).

  • source: The device/app used to publish the tweet (Apparently not working, all values are Nan so far).

  • retweet_count: Integer. Number of retweets to the Tweet at the moment of the Tweet extraction.

  • like_count: Integer. Number of Likes to the Tweet at the moment of the Tweet extraction.

  • reply_count: Integer. Number of reply messages to the Tweet.

  • impression_count: Integer. Number of times the Tweet has been seen at the moment of the Tweet extraction.

More info: Tweets API info definition: https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/tweet Users API info definition: https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/user

Search
Clear search
Close search
Google apps
Main menu