https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
UPDATE: Due to new Twitter API conditions changed by Elon Musk, now it's no longer free to use the Twitter (X) API and the pricing is 100 $/month in the hobby plan. So my automated ETL notebook stopped from updating new tweets to this dataset on May 13th 2023.
This dataset is was updated everyday with the addition of 1000 tweets/day containing any of the words "ChatGPT", "GPT3", or "GPT4", starting from the 3rd of April 2023. Everyday's tweets are uploaded 24-72h later, so the counter on tweets' likes, retweets, messages and impressions gets enough time to be relevant. Tweets are from any language selected randomly from all hours of the day. There are some basic filters applied trying to discard sensitive tweets and spam.
This dataset can be used for many different applications regarding to Data Analysis and Visualization but also NLP Sentiment Analysis techniques and more.
Consider upvoting this Dataset and the ETL scheduled Notebook providing new data everyday into it if you found them interesting, thanks! š¤
tweet_id: Integer. unique identifier for each tweet. Older tweets have smaller IDs.
tweet_created: Timestamp. Time of the tweet's creation.
tweet_extracted: Timestamp. The UTC time when the ETL pipeline pulled the tweet and its metadata (likes count, retweets count, etc).
text: String. The raw payload text from the tweet.
lang: String. Short name for the Tweet text's language.
user_id: Integer. Twitter's unique user id.
user_name: String. The author's public name on Twitter.
user_username: String. The author's Twitter account username (@example)
user_location: String. The author's public location.
user_description: String. The author's public profile's bio.
user_created: Timestamp. Timestamp of user's Twitter account creation.
user_followers_count: Integer. The number of followers of the author's account at the moment of the tweet extraction
user_following_count: Integer. The number of followed accounts from the author's account at the moment of the Tweet extraction
user_tweet_count: Integer. The number of Tweets that the author has published at the moment of the Tweet extraction.
user_verified: Boolean. True if the user is verified (blue mark).
source: The device/app used to publish the tweet (Apparently not working, all values are Nan so far).
retweet_count: Integer. Number of retweets to the Tweet at the moment of the Tweet extraction.
like_count: Integer. Number of Likes to the Tweet at the moment of the Tweet extraction.
reply_count: Integer. Number of reply messages to the Tweet.
impression_count: Integer. Number of times the Tweet has been seen at the moment of the Tweet extraction.
More info: Tweets API info definition: https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/tweet Users API info definition: https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/user
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset captures a daily collection of tweets containing keywords such as "ChatGPT", "GPT3", or "GPT4". It was designed to provide a rich source of social media data for analysis, particularly for applications concerning Natural Language Processing (NLP) and sentiment analysis. The collection process began on 3rd April 2023, with approximately 1,000 tweets added daily. Tweets were extracted 24-72 hours after creation to allow for relevant engagement metrics like likes and retweets to accumulate. However, updates to this dataset ceased on 13th May 2023, due to changes in Twitter (X) API conditions, which introduced a cost for its use. The dataset includes tweets from various languages, selected randomly throughout the day, with basic filters applied to discard sensitive content and spam.
The dataset is provided in a CSV file format, generated from a Pandas DataFrame, with each row containing the tweet's text and its metadata, along with the author's information. The collection started on 3rd April 2023, adding approximately 1,000 tweets per day, and stopped updating on 13th May 2023. While specific total row counts are not available, various segments show substantial data, such as 43,000 tweets collected between 22nd September 2022 and 12th May 2023. Daily additions of 1,000 to 7,000 tweets are noted for the period of 8th April 2023 to 14th May 2023. The dataset includes unique values for over 25,000 tweet IDs, over 37,000 unique user IDs, and over 38,000 unique user locations.
This dataset is ideal for various data analysis and visualisation applications. It is particularly well-suited for Natural Language Processing (NLP) techniques, including sentiment analysis, to understand public opinion and trends related to ChatGPT, GPT3, and GPT4. Researchers can use it for social media listening, trend tracking, and studying the evolution of discussions around large language models.
The dataset primarily covers tweets from 3rd April 2023 to 13th May 2023, with some older tweets included, particularly from September 2022. Tweets are from any language, randomly selected globally. English (en) tweets constitute approximately 48% of the dataset, Japanese (ja) tweets make up about 23%, and other languages account for 30%. User locations vary widely, with a significant portion (41%) being null, 1% from Japan, and the remaining 59% from various other global locations.
CC0
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
This dataset contains 14,934 instructions, contexts and responses, in several natural language categories such as classification, closed QA, generation, etc. The English original dataset was created by @databricks, who crowd-sourced the data creation via its employees. The current dataset is a translation of that dataset through ChatGPT (gpt-3.5-turbo).
Data Instances
{ "id": 14963, "instruction": "Wat zijn de duurste steden ter wereld?", "context": "", "response": "Dit is een uitgebreide lijst van de duurste steden: Singapore, Tel Aviv, New York, Hong Kong, Los Angeles, Zurich, GenĆØve, San Francisco, Parijs en Sydney.", "category": "brainstorming" }
Data Fields
id: the ID of the item. The following 77 IDs are not included because they could not be translated (or were too long): [1502, 1812, 1868, 4179, 4541, 6347, 8851, 9321, 10588, 10835, 11257, 12082, 12319, 12471, 12701, 12988, 13066, 13074, 13076, 13181, 13253, 13279, 13313, 13346, 13369, 13446, 13475, 13528, 13546, 13548, 13549, 13558, 13566, 13600, 13603, 13657, 13668, 13733, 13765, 13775, 13801, 13831, 13906, 13922, 13923, 13957, 13967, 13976, 14028, 14031, 14045, 14050, 14082, 14083, 14089, 14110, 14155, 14162, 14181, 14187, 14200, 14221, 14222, 14281, 14473, 14475, 14476, 14587, 14590, 14667, 14685, 14764, 14780, 14808, 14836, 14891, 1 4966]
instruction: the instruction (question)
context: additional context that the AI can use to answer the question
response: the AI's expected response
category: the category of this type of question (see Dolly for more info)
Dataset Creation
Both the translations and the topics were translated with OpenAI's API for gpt-3.5-turbo. max_tokens=1024, temperature=0 as parameters.
The prompt template to translate the input is (where src_lang was English and tgt_lang Dutch):
CONVERSATION_TRANSLATION_PROMPT = """You are asked to translate a task's instruction, optional context to the task, and the response to the task, from {src_lang} to {tgt_lang}.
Here are the requirements that you should adhere to:
1. maintain the format: the task consists of a task instruction (marked instruction:
), optional context to the task (marked context:
) and response for the task marked with response:
;
2. do not translate the identifiers instruction:
, context:
, and response:
but instead copy them to your output;
3. make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias;
4. translate the instruction and context text using informal, but standard, language;
5. make sure to avoid biases (such as gender bias, grammatical bias, social bias);
6. if the instruction is to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in the context in {tgt_lang}, and then also generate a corrected output version in the output in {tgt_lang};
7. if the instruction is to translate text from one language to another, then you do not translate the text that needs to be translated in the instruction or the context, nor the translation in the response (just copy them as-is);
8. do not translate code fragments but copy them to your output. If there are English examples, variable names or definitions in code fragments, keep them in English.
Now translate the following task with the requirements set out above. Do not provide an explanation and do not add anything else.
"""
The system message was:
You are a helpful assistant that translates English to Dutch according to the requirements that are given to you.
Note that 77 items (0.5%) were not successfully translated. This can either mean that the prompt was too long for the given limit (max_tokens=1024) or that the generated translation could not be parsed into instruction, context and response fields. The missing IDs are [1502, 1812, 1868, 4179, 4541, 6347, 8851, 9321, 10588, 10835, 11257, 12082, 12319, 12471, 12701, 12988, 13066, 13074, 13076, 13181, 13253, 13279, 13313, 13346, 13369, 13446, 13475, 13528, 13546, 13548, 13549, 13558, 13566, 13600, 13603, 13657, 13668, 13733, 13765, 13775, 13801, 13831, 13906, 13922, 13923, 13957, 13967, 13976, 14028, 14031, 14045, 14050, 14082, 14083, 14089, 14110, 14155, 14162, 14181, 14187, 14200, 14221, 14222, 14281, 14473, 14475, 14476, 14587, 14590, 14667, 14685, 14764, 14780, 14808, 14836, 14891, 1 4966].
Initial Data Collection and Normalization
Initial data collection by databricks. See their repository for more information about this dataset.
Considerations for Using the Data
Note that the translations in this new dataset have not been verified by humans! Use at your own risk, both in terms of quality and biases.
Discussion of Biases
As with any machine-generated texts, users should be aware of potential biases that are included in this dataset. Although the prompt specifically includes make sure to avoid biases (such as gender bias, grammatical bias, social bias), of course the impact of such command is not known. It is likely that biases remain in the dataset so use with caution.
Other Known Limitations
The translation quality has not been verified. Use at your own risk!
Licensing Information
This repository follows the original databricks license, which is CC BY-SA 3.0 but see below for a specific restriction.
This text was generated (either in part or in full) with GPT-3 (gpt-3.5-turbo), OpenAIās large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication.
If you use this dataset, you must also follow the Sharing and Usage policies.
As clearly stated in their Terms of Use, specifically 2c.iii, "[you may not] use output from the Services to develop models that compete with OpenAI". That means that you cannot use this dataset to build models that are intended to commercially compete with OpenAI. As far as I am aware, that is a specific restriction that should serve as an addendum to the current license.
This dataset is also available on the Hugging Face hub, its canonical repository.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Background: Several chatbots that utilize large language models now exist. As a particularly well-known example, ChatGPT employs an autoregressive modeling process to generate responses, predicting the next word based on previously derived words. Consequently, instead of deducing a correct answer, it arranges the most frequently appearing words in the learned data in order. Optimized for interactivity and content generation, it presents a smooth and plausible context, regardless of whether the content it presents is true. This report aimed to examine the reliability of ChatGPT, an artificial intelligence (AI) chatbot, in diagnosing diseases and treating patients, how to interpret its responses, and directions for future development.Current Concepts: Ten published case reports from Korea were analyzed to evaluate the efficacy of ChatGPT, which was asked to describe the correct diagnosis and treatment. ChatGPT answered 3 cases correctly after being provided with the patientās symptoms, findings, and medical history. The accuracy rate increased to 7 out of 10 after adding laboratory, pathological, and radiological results. In one case, ChatGPT did not provide appropriate information about suitable treatment, and its response contained inappropriate content in 4 cases. In contrast, ChatGPT recommended appropriate measures in 4 cases.Discussion and Conclusion: ChatGPTās responses to the 10 case reports could have been better. To utilize ChatGPT efficiently and appropriately, users should possess sufficient knowledge and skills to determine the validity of its responses. AI chatbots based on large language models will progress significantly, but physicians must be vigilant in using these tools in practice.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
In October and November 2023, researchers at the Schwartz Reisman Institute for Technology and Society and the Policy, Elections and Representation Lab at the Munk School of Global Affairs and Public Policy at the University of Toronto completed a survey on public perceptions of and attitudes toward AI. The survey was administered to over 1,000 people in each of 21 countries, for a total of 23,882 surveys conducted in 12 languages. The combined populations of the countries sampled represent a majority of the world's population. Countries: Argentina, Australia, Brazil, Canada, Chile, China, France, Germany, India, Indonesia, Italy, Japan, Kenya, Mexico, Pakistan, Poland, Portugal, South Africa, Spain, United Kingdom, United States of America Languages: Chinese (Simplified), English, French, German, Indonesian, Italian, Japanese, Polish, Portuguese (Portugal), Portuguese (Brazil), Spanish (Spain), Spanish (Latin America). The survey explored general knowledge of and attitudes toward AI. Topics included concerns about AI, safety, regulation, autonomous vehicles and AI's effect on jobs now and in the future. Participants were asked whether they are interested in or trust applications of AI for clothes, travel, grocery shopping, dating or finance. Respondents were asked about their attitudes toward the use of emerging technologies in education, the justice system, health care and immigration. Respondents were also asked about their knowledge of and experience with ChatGPT and deepfakes.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Nanozymes are unique materials with many valuable properties for applications in biomedicine, biosensing, environmental monitoring, and beyond. In this work, we developed a machine learning (ML) approach to search for new nanozymes and deployed a web platform, DiZyme, featuring a state-of-the-art database of nanozymes containing 1210 experimental samples, catalytic activity prediction, and DiZyme Assistant interface powered by a large language model (LLM). For the first time, we enable the prediction of multiple catalytic activities of nanozymes by training an ensemble learning algorithm achieving R2 = 0.75 for the MichaelisāMenten constant and R2 = 0.77 for the maximum velocity on unseen test data. We envision an accurate prediction of multiple catalytic activities (peroxidase, oxidase, and catalase) promoting novel applications for a wide range of surface-modified inorganic nanozymes. The DiZyme Assistant based on the ChatGPT model provides users with supporting information on experimental samples, such as synthesis procedures, measurement protocols, etc. DiZyme (dizyme.aicidlab.itmo.ru) is now openly available worldwide.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
UPDATE: Due to new Twitter API conditions changed by Elon Musk, now it's no longer free to use the Twitter (X) API and the pricing is 100 $/month in the hobby plan. So my automated ETL notebook stopped from updating new tweets to this dataset on May 13th 2023.
This dataset is was updated everyday with the addition of 1000 tweets/day containing any of the words "ChatGPT", "GPT3", or "GPT4", starting from the 3rd of April 2023. Everyday's tweets are uploaded 24-72h later, so the counter on tweets' likes, retweets, messages and impressions gets enough time to be relevant. Tweets are from any language selected randomly from all hours of the day. There are some basic filters applied trying to discard sensitive tweets and spam.
This dataset can be used for many different applications regarding to Data Analysis and Visualization but also NLP Sentiment Analysis techniques and more.
Consider upvoting this Dataset and the ETL scheduled Notebook providing new data everyday into it if you found them interesting, thanks! š¤
tweet_id: Integer. unique identifier for each tweet. Older tweets have smaller IDs.
tweet_created: Timestamp. Time of the tweet's creation.
tweet_extracted: Timestamp. The UTC time when the ETL pipeline pulled the tweet and its metadata (likes count, retweets count, etc).
text: String. The raw payload text from the tweet.
lang: String. Short name for the Tweet text's language.
user_id: Integer. Twitter's unique user id.
user_name: String. The author's public name on Twitter.
user_username: String. The author's Twitter account username (@example)
user_location: String. The author's public location.
user_description: String. The author's public profile's bio.
user_created: Timestamp. Timestamp of user's Twitter account creation.
user_followers_count: Integer. The number of followers of the author's account at the moment of the tweet extraction
user_following_count: Integer. The number of followed accounts from the author's account at the moment of the Tweet extraction
user_tweet_count: Integer. The number of Tweets that the author has published at the moment of the Tweet extraction.
user_verified: Boolean. True if the user is verified (blue mark).
source: The device/app used to publish the tweet (Apparently not working, all values are Nan so far).
retweet_count: Integer. Number of retweets to the Tweet at the moment of the Tweet extraction.
like_count: Integer. Number of Likes to the Tweet at the moment of the Tweet extraction.
reply_count: Integer. Number of reply messages to the Tweet.
impression_count: Integer. Number of times the Tweet has been seen at the moment of the Tweet extraction.
More info: Tweets API info definition: https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/tweet Users API info definition: https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/user