8 datasets found
  1. C

    ChatGPT Statistics 2025

    • sambretzmann.com
    Updated Jun 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). ChatGPT Statistics 2025 [Dataset]. https://sambretzmann.com/chatgpt-statistics/
    Explore at:
    Dataset updated
    Jun 12, 2025
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Comprehensive ChatGPT statistics covering 800 million weekly users, $300 billion valuation, market share, demographics, and technical specifications for 2025.

  2. E

    Google Gemini Statistics By Features, Performance and AI Versions

    • enterpriseappstoday.com
    Updated Dec 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    EnterpriseAppsToday (2023). Google Gemini Statistics By Features, Performance and AI Versions [Dataset]. https://www.enterpriseappstoday.com/stats/google-gemini-statistics.html
    Explore at:
    Dataset updated
    Dec 20, 2023
    Dataset authored and provided by
    EnterpriseAppsToday
    License

    https://www.enterpriseappstoday.com/privacy-policyhttps://www.enterpriseappstoday.com/privacy-policy

    Time period covered
    2022 - 2032
    Area covered
    Global
    Description

    Google Gemini Statistics: In 2023, Google unveiled the most powerful AI model to date. Google Gemini is the world’s most advanced AI leaving the ChatGPT 4 behind in the line. Google has 3 different sizes of models, superior to each, and can perform tasks accordingly. According to Google Gemini Statistics, these can understand and solve complex problems related to absolutely anything. Google even said, they will develop AI in such as way that it will let you know how helpful AI is in our daily routine. Well, we hope our next generation won’t be fully dependent on such technologies, otherwise, we will lose all of our natural talent! Editor’s Choice Google Gemini can follow natural and engaging conversations. According to Google Gemini Statistics, Gemini Ultra has a 90.0% score on the MMLU benchmark for testing the knowledge of and problem-solving on subjects including history, physics, math, law, ethics, history, and medicine. If you ask Gemini what to do with your raw material, it can provide you with ideas in the form of text or images according to the given input. Gemini has outperformed ChatGPT -4 tests in the majority of the cases. According to the report this LLM is said to be unique because it can process multiple types of data at the same time along with video, images, computer code, and text. Google is considering its development as The Gemini Era, showing the importance of our AI is significant in improving our daily lives. Google Gemini can talk like a real person Gemini Ultra is the largest model and can solve extremely complex problems. Gemini models are trained on multilingual and multimodal datasets. Gemini’s Ultra performance on the MMMU benchmark has also outperformed the GPT-4V in the following results Art and Design (74.2), Business (62.7), Health and Medicine (71.3), Humanities and Social Science (78.3), and Technology and Engineering (53.00).

  3. o

    #ChatGPT 1000 Daily 🐦 Tweets

    • opendatabay.com
    .csv
    Updated Jun 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). #ChatGPT 1000 Daily 🐦 Tweets [Dataset]. https://www.opendatabay.com/data/ai-ml/2cf951da-3ce1-4606-a8d6-3f865c4d8a3b
    Explore at:
    .csvAvailable download formats
    Dataset updated
    Jun 18, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Social Media and Networking
    Description

    UPDATE: Due to new Twitter API conditions changed by Elon Musk, now it's no longer free to use the Twitter (X) API and the pricing is 100 $/month in the hobby plan. So my automated ETL notebook stopped from updating new tweets to this dataset on May 13th 2023.

    This dataset is was updated everyday with the addition of 1000 tweets/day containing any of the words "ChatGPT", "GPT3", or "GPT4", starting from the 3rd of April 2023. Everyday's tweets are uploaded 24-72h later, so the counter on tweets' likes, retweets, messages and impressions gets enough time to be relevant. Tweets are from any language selected randomly from all hours of the day. There are some basic filters applied trying to discard sensitive tweets and spam.

    This dataset can be used for many different applications regarding to Data Analysis and Visualization but also NLP Sentiment Analysis techniques and more.

    Consider upvoting this Dataset and the ETL scheduled Notebook providing new data everyday into it if you found them interesting, thanks! 🤗

    Columns Description: tweet_id: Integer. unique identifier for each tweet. Older tweets have smaller IDs.

    tweet_created: Timestamp. Time of the tweet's creation.

    tweet_extracted: Timestamp. The UTC time when the ETL pipeline pulled the tweet and its metadata (likes count, retweets count, etc).

    text: String. The raw payload text from the tweet.

    lang: String. Short name for the Tweet text's language.

    user_id: Integer. Twitter's unique user id.

    user_name: String. The author's public name on Twitter.

    user_username: String. The author's Twitter account username (@example)

    user_location: String. The author's public location.

    user_description: String. The author's public profile's bio.

    user_created: Timestamp. Timestamp of user's Twitter account creation.

    user_followers_count: Integer. The number of followers of the author's account at the moment of the tweet extraction

    user_following_count: Integer. The number of followed accounts from the author's account at the moment of the Tweet extraction

    user_tweet_count: Integer. The number of Tweets that the author has published at the moment of the Tweet extraction.

    user_verified: Boolean. True if the user is verified (blue mark).

    source: The device/app used to publish the tweet (Apparently not working, all values are Nan so far). #ChatGPT 1000 Daily 🐦 Tweets retweet_count: Integer. Number of retweets to the Tweet at the moment of the Tweet extraction.

    like_count: Integer. Number of Likes to the Tweet at the moment of the Tweet extraction.

    reply_count: Integer. Number of reply messages to the Tweet.

    impression_count: Integer. Number of times the Tweet has been seen at the moment of the Tweet extraction.

    More info: Tweets API info definition: https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/tweet Users API info definition: https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/user

    License

    CC0

    Original Data Source: https://www.kaggle.com/datasets/edomingo/chatgpt-1000-daily-tweets

  4. h

    awesome-chatgpt-prompts

    • huggingface.co
    Updated Dec 15, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fatih Kadir Akın (2023). awesome-chatgpt-prompts [Dataset]. https://huggingface.co/datasets/fka/awesome-chatgpt-prompts
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 15, 2023
    Authors
    Fatih Kadir Akın
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    🧠 Awesome ChatGPT Prompts [CSV dataset]

    This is a Dataset Repository of Awesome ChatGPT Prompts View All Prompts on GitHub

      License
    

    CC-0

  5. P

    HaluEval Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Mar 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Junyi Li; Xiaoxue Cheng; Wayne Xin Zhao; Jian-Yun Nie; Ji-Rong Wen (2025). HaluEval Dataset [Dataset]. https://paperswithcode.com/dataset/halueval
    Explore at:
    Dataset updated
    Mar 17, 2024
    Authors
    Junyi Li; Xiaoxue Cheng; Wayne Xin Zhao; Jian-Yun Nie; Ji-Rong Wen
    Description

    HaluEval is a large-scale hallucination evaluation benchmark designed for Large Language Models (LLMs). It provides a comprehensive collection of generated and human-annotated hallucinated samples to evaluate the performance of LLMs in recognizing hallucinations¹².

    Here are the key details about the HaluEval dataset:

    Purpose and Overview: Purpose: HaluEval aims to understand the types of content and the extent to which LLMs are prone to hallucinate. Content: It includes both general user queries with ChatGPT responses and task-specific examples from three tasks: question answering, knowledge-grounded dialogue, and text summarization.

    Data Sources:

    For general user queries, HaluEval adopts the 52K instruction tuning dataset from Alpaca. Task-specific examples are generated based on existing task datasets (e.g., HotpotQA, OpenDialKG, CNN/Daily Mail) as seed data.

    Data Composition:

    General User Queries: 5,000 user queries paired with ChatGPT responses. Queries are selected based on low-similarity responses to identify potential hallucinations.

    Task-Specific Examples:

    30,000 examples from three tasks: Question Answering: Based on HotpotQA as seed data. Knowledge-Grounded Dialogue: Based on OpenDialKG as seed data. Text Summarization: Based on CNN/Daily Mail as seed data.

    Data Release:

    The dataset contains 35,000 generated and human-annotated hallucinated samples used in experiments. JSON files include: qa_data.json: Hallucinated QA samples. dialogue_data.json: Hallucinated dialogue samples. summarization_data.json: Hallucinated summarization samples. general_data.json: Human-annotated ChatGPT responses to general user queries.

    Source: Conversation with Bing, 3/17/2024 (1) HaluEval: A Hallucination Evaluation Benchmark for LLMs. https://github.com/RUCAIBox/HaluEval. (2) jzjiao/halueval-sft · Datasets at Hugging Face. https://huggingface.co/datasets/jzjiao/halueval-sft. (3) HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large .... https://aclanthology.org/2023.emnlp-main.397/. (4) HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large .... https://arxiv.org/abs/2305.11747. (5) undefined. https://github.com/RUCAIBox/HaluEval%29.

  6. Customer Shopping Trends Dataset

    • kaggle.com
    Updated Oct 5, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sourav Banerjee (2023). Customer Shopping Trends Dataset [Dataset]. https://www.kaggle.com/datasets/iamsouravbanerjee/customer-shopping-trends-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 5, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sourav Banerjee
    Description

    Context

    The Customer Shopping Preferences Dataset offers valuable insights into consumer behavior and purchasing patterns. Understanding customer preferences and trends is critical for businesses to tailor their products, marketing strategies, and overall customer experience. This dataset captures a wide range of customer attributes including age, gender, purchase history, preferred payment methods, frequency of purchases, and more. Analyzing this data can help businesses make informed decisions, optimize product offerings, and enhance customer satisfaction. The dataset stands as a valuable resource for businesses aiming to align their strategies with customer needs and preferences. It's important to note that this dataset is a Synthetic Dataset Created for Beginners to learn more about Data Analysis and Machine Learning.

    Content

    This dataset encompasses various features related to customer shopping preferences, gathering essential information for businesses seeking to enhance their understanding of their customer base. The features include customer age, gender, purchase amount, preferred payment methods, frequency of purchases, and feedback ratings. Additionally, data on the type of items purchased, shopping frequency, preferred shopping seasons, and interactions with promotional offers is included. With a collection of 3900 records, this dataset serves as a foundation for businesses looking to apply data-driven insights for better decision-making and customer-centric strategies.

    Dataset Glossary (Column-wise)

    • Customer ID - Unique identifier for each customer
    • Age - Age of the customer
    • Gender - Gender of the customer (Male/Female)
    • Item Purchased - The item purchased by the customer
    • Category - Category of the item purchased
    • Purchase Amount (USD) - The amount of the purchase in USD
    • Location - Location where the purchase was made
    • Size - Size of the purchased item
    • Color - Color of the purchased item
    • Season - Season during which the purchase was made
    • Review Rating - Rating given by the customer for the purchased item
    • Subscription Status - Indicates if the customer has a subscription (Yes/No)
    • Shipping Type - Type of shipping chosen by the customer
    • Discount Applied - Indicates if a discount was applied to the purchase (Yes/No)
    • Promo Code Used - Indicates if a promo code was used for the purchase (Yes/No)
    • Previous Purchases - The total count of transactions concluded by the customer at the store, excluding the ongoing transaction
    • Payment Method - Customer's most preferred payment method
    • Frequency of Purchases - Frequency at which the customer makes purchases (e.g., Weekly, Fortnightly, Monthly)

    Structure of the Dataset

    https://i.imgur.com/6UEqejq.png" alt="">

    Acknowledgement

    This dataset is a synthetic creation generated using ChatGPT to simulate a realistic customer shopping experience. Its purpose is to provide a platform for beginners and data enthusiasts, allowing them to create, enjoy, practice, and learn from a dataset that mirrors real-world customer shopping behavior. The aim is to foster learning and experimentation in a simulated environment, encouraging a deeper understanding of data analysis and interpretation in the context of consumer preferences and retail scenarios.

    Cover Photo by: Freepik

    Thumbnail by: Clothing icons created by Flat Icons - Flaticon

  7. WildChat

    • huggingface.co
    Updated Jul 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2024). WildChat [Dataset]. https://huggingface.co/datasets/allenai/WildChat
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 23, 2024
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    Ai2
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    Dataset Card for WildChat

      Note: a newer version with 1 million conversations and demographic information can be found here.
    
    
    
    
    
      Dataset Description
    

    Paper: https://arxiv.org/abs/2405.01470

    Interactive Search Tool: https://wildvisualizer.com (paper)

    License: ODC-BY

    Language(s) (NLP): multi-lingual

    Point of Contact: Yuntian Deng

      Dataset Summary
    

    WildChat is a collection of 650K conversations between human users and ChatGPT. We collected WildChat… See the full description on the dataset page: https://huggingface.co/datasets/allenai/WildChat.

  8. m

    Data Retention Period Disclosures in Privacy Policies

    • data.mendeley.com
    Updated Jul 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Rodríguez (2024). Data Retention Period Disclosures in Privacy Policies [Dataset]. http://doi.org/10.17632/c4x958pzpm.2
    Explore at:
    Dataset updated
    Jul 7, 2024
    Authors
    David Rodríguez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    115 privacy policies from the OPP-115 corpus have been re-annotated with the specific data retention periods disclosed, aligned with the GDPR requirements disclosed in Art. 13 (2)(a). Those retention periods have been categorized into the following 6 distinct cases:

    C0: No data retention period is indicated in the privacy policy/segment. C1: A specific data retention period is indicated (e.g., days, weeks, months...). C2: Indicate that the data will be stored indefinitely. C3: A criterion is determined during which a defined period during which the data will be stored can be understood (e.g., as long as the user has an active account). C4: It is indicated that personal data will be stored for an unspecified period, for fraud prevention, legal or security reasons. C5: It is indicated that personal data will be stored for an unspecified period, for purposes other than fraud prevention, legal, or security. Note: If the privacy policy or segment accounts for more than one case, the case with the highest value was annotated (e.g., if case C2 and case C4 apply, C4 is annotated).

    Then, the ground truth dataset served as validation for our proposed ChatGPT-based method, the results of which have also been included in this dataset.

    Columns description: - policy_id: ID of the policy in the OPP-115 dataset - policy_name: Domain of the privacy policy - policy_text: Privacy policy collected at the time of OPP-115 dataset creation - info_type_value: Type of personal data to which data retention refers - retention_period: Period of retention annotated by OPP-115 annotators - actual_case: Our annotated case ranging from C0-C5 - GPT_case: ChatGPT classification of the case identified in the segment - actual_Comply_GDPR: Boolean denoting True if they apparently comply with GDPR (cases C1-C5) or False if not (case C0) - GPT_Comply_GDPR: Boolean denoting True if they apparently comply with GDPR (cases C1-C5) or False if not (case C0) - paragraphs_retention_period: List containing the paragraphs annotated as Data Retention by OPP-115 annotators and our red text describing the relevant information used for our annotation decision

  9. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2025). ChatGPT Statistics 2025 [Dataset]. https://sambretzmann.com/chatgpt-statistics/

ChatGPT Statistics 2025

Explore at:
8 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jun 12, 2025
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Comprehensive ChatGPT statistics covering 800 million weekly users, $300 billion valuation, market share, demographics, and technical specifications for 2025.

Search
Clear search
Close search
Google apps
Main menu