23 datasets found
  1. AI Financial Market Data

    • kaggle.com
    Updated Aug 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Science Lovers (2025). AI Financial Market Data [Dataset]. https://www.kaggle.com/datasets/rohitgrewal/ai-financial-and-market-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 6, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Data Science Lovers
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    📹Project Video available on YouTube - https://youtu.be/WmJYHz_qn5s

    Realistic Synthetic - AI Financial & Market Data for Gemini(Google), ChatGPT(OpenAI), Llama(Meta)

    This dataset provides a synthetic, daily record of financial market activities related to companies involved in Artificial Intelligence (AI). There are key financial metrics and events that could influence a company's stock performance like launch of Llama by Meta, launch of GPT by OpenAI, launch of Gemini by Google etc. Here, we have the data about how much amount the companies are spending on R & D of their AI's Products & Services, and how much revenue these companies are generating. The data is from January 1, 2015, to December 31, 2024, and includes information for various companies : OpenAI, Google and Meta.

    This data is available as a CSV file. We are going to analyze this data set using the Pandas DataFrame.

    This analyse will be helpful for those working in Finance or Share Market domain.

    From this dataset, we extract various insights using Python in our Project.

    1) How much amount the companies spent on R & D ?

    2) Revenue Earned by the companies

    3) Date-wise Impact on the Stock

    4) Events when Maximum Stock Impact was observed

    5) AI Revenue Growth of the companies

    6) Correlation between the columns

    7) Expenditure vs Revenue year-by-year

    8) Event Impact Analysis

    9) Change in the index wrt Year & Company

    These are the main Features/Columns available in the dataset :

    1) Date: This column indicates the specific calendar day for which the financial and AI-related data is recorded. It allows for time-series analysis of the trends and impacts.

    2) Company: This column specifies the name of the company to which the data in that particular row belongs. Examples include "OpenAI" and "Meta".

    3) R&D_Spending_USD_Mn: This column represents the Research and Development (R&D) spending of the company, measured in Millions of USD. It serves as an indicator of a company's investment in innovation and future growth, particularly in the AI sector.

    4) AI_Revenue_USD_Mn: This column denotes the revenue generated specifically from AI-related products or services, also measured in Millions of USD. This metric highlights the direct financial success derived from AI initiatives.

    5) AI_Revenue_Growth_%: This column shows the percentage growth of AI-related revenue for the company on a daily basis. It indicates the pace at which a company's AI business is expanding or contracting.

    6) Event: This column captures any significant events or announcements made by the company that could potentially influence its financial performance or market perception. Examples include "Cloud AI launch," "AI partnership deal," "AI ethics policy update," and "AI speech recognition release." These events are crucial for understanding sudden shifts in stock impact.

    7) Stock_Impact_%: This column quantifies the percentage change in the company's stock price on a given day, likely in response to the recorded financial metrics or events. It serves as a direct measure of market reaction.

  2. h

    chatgpt-paraphrases

    • huggingface.co
    Updated Mar 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Humarin (2023). chatgpt-paraphrases [Dataset]. https://huggingface.co/datasets/humarin/chatgpt-paraphrases
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 17, 2023
    Dataset authored and provided by
    Humarin
    License

    https://choosealicense.com/licenses/openrail/https://choosealicense.com/licenses/openrail/

    Description

    This is a dataset of paraphrases created by ChatGPT. Model based on this dataset is avaible: model

      We used this prompt to generate paraphrases
    

    Generate 5 similar paraphrases for this question, show it like a numbered list without commentaries: {text} This dataset is based on the Quora paraphrase question, texts from the SQUAD 2.0 and the CNN news dataset. We generated 5 paraphrases for each sample, totally this dataset has about 420k data rows. You can make 30 rows from a row from… See the full description on the dataset page: https://huggingface.co/datasets/humarin/chatgpt-paraphrases.

  3. SaaS Subscription & Churn Analytics Dataset

    • kaggle.com
    Updated Jul 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rivalytics (2025). SaaS Subscription & Churn Analytics Dataset [Dataset]. https://www.kaggle.com/datasets/rivalytics/saas-subscription-and-churn-analytics-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 21, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rivalytics
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    RavenStack is a fictional AI-powered collaboration platform used to simulate a real-world SaaS business. This simulated dataset was created using Python and ChatGPT specifically for people learning data analysis, business intelligence, or data science. It offers a realistic environment to practice SQL joins, cohort analysis, churn modeling, revenue tracking, and support analytics using a multi-table relational structure.

    The dataset spans 5 CSV files:

    • accounts.csv – customer metadata

    • subscriptions.csv – subscription lifecycles and revenue

    • feature_usage.csv – daily product interaction logs

    • support_tickets.csv – support activity and satisfaction scores

    • churn_events.csv – churn dates, reasons, and refund behaviors

    Users can explore trial-to-paid conversion, MRR trends, upgrade funnels, feature adoption, support patterns, churn drivers, and reactivation cycles. The dataset supports temporal and cohort analyses, and has built-in edge cases for testing real-world logic.

  4. p

    AI-Driven Mental Health Literacy - An Interventional Study from India...

    • psycharchives.org
    Updated Oct 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). AI-Driven Mental Health Literacy - An Interventional Study from India (Codebook for the data).csv [Dataset]. https://psycharchives.org/handle/20.500.12034/8771
    Explore at:
    Dataset updated
    Oct 2, 2023
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Area covered
    India
    Description

    The dataset is from an Indian study which made use of ChatGPT- a natural language processing model by OpenAI to design a mental health literacy intervention for college students. Prompt engineering tactics were used to formulate prompts that acted as anchors in the conversations with the AI agent regarding mental health. An intervention lasting for 20 days was designed with sessions of 15-20 minutes on alternative days. Fifty-one students completed pre-test and post-test measures of mental health literacy, mental help-seeking attitude, stigma, mental health self-efficacy, positive and negative experiences, and flourishing in the main study, which were then analyzed using paired t-tests. The results suggest that the intervention is effective among college students as statistically significant changes were noted in mental health literacy and mental health self-efficacy scores. The study affirms the practicality, acceptance, and initial indications of AI-driven methods in advancing mental health literacy and suggests the promising prospects of innovative platforms such as ChatGPT within the field of applied positive psychology.: Codebook for the Dataset provided

  5. Prompts generated from ChatGPT3.5, ChatGPT4, LLama3-8B, and Mistral-7B with...

    • zenodo.org
    • portaldelaciencia.uva.es
    • +1more
    bin
    Updated Nov 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martínez Gonzalo; Martínez Gonzalo; Hernández José Alberto; Hernández José Alberto; Conde Javier; Conde Javier; Reviriego Pedro; Reviriego Pedro; Merino Elena; Merino Elena (2024). Prompts generated from ChatGPT3.5, ChatGPT4, LLama3-8B, and Mistral-7B with NYT and HC3 topics in different roles and parameters configurations [Dataset]. http://doi.org/10.5281/zenodo.11121394
    Explore at:
    binAvailable download formats
    Dataset updated
    Nov 16, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Martínez Gonzalo; Martínez Gonzalo; Hernández José Alberto; Hernández José Alberto; Conde Javier; Conde Javier; Reviriego Pedro; Reviriego Pedro; Merino Elena; Merino Elena
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description

    Prompts generated from ChatGPT3.5, ChatGPT4, Llama3-8B, and Mistral-7B with NYT and HC3 topics in different roles and parameter configurations.

    The dataset is useful to study lexical aspects of LLMs with different parameters/roles configurations.

    • The 0_Base_Topics.xlsx file lists the topics used for the dataset generation
    • The rest of the files collect the answers of ChatGPT to these topics with different configurations of parameters/context:
      • Temperature (parameter): Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.
      • Frequency penalty (parameter): Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
      • Top probability (parameter): An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass.
      • Presence penalty (parameter): Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
      • Roles (context)
        • Default: No role is assigned to the LLM, the default role is used.
        • Child: The LLM is requested to answer as a five-year-old child.
        • Young adult male: The LLM is requested to answer as a young male adult.
        • Young adult female: The LLM is requested to answer as a young female adult.
        • Elderly adult male: The LLM is requested to answer as an elderly male adult.
        • Elderly adult female: The LLM is requested to answer as an elderly female adult.
        • Affluent adult male: The LLM is requested to answer as an affluent male adult.
        • Affluent adult female: The LLM is requested to answer as an affluent female adult.
        • Lower-class adult male: The LLM is requested to answer as a lower-class male adult.
        • Lower-class adult female: The LLM is requested to answer as a lower-class female adult.
        • Erudite: The LLM is requested to answer as an erudite who uses a rich vocabulary.

    Paper

    @article{10.1145/3696459,
    author = {Mart\'{\i}nez, Gonzalo and Hern\'{a}ndez, Jos\'{e} Alberto and Conde, Javier and Reviriego, Pedro and Merino-G\'{o}mez, Elena},
    title = {Beware of Words: Evaluating the Lexical Diversity of Conversational LLMs using ChatGPT as Case Study},
    year = {2024},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    issn = {2157-6904},
    url = {https://doi.org/10.1145/3696459},
    doi = {10.1145/3696459},
    abstract = ,
    note = {Just Accepted},
    journal = {ACM Trans. Intell. Syst. Technol.},
    month = sep,
    keywords = {LLM, Lexical diversity, ChatGPT, Evaluation}
    }

  6. h

    awesome-chatgpt-prompts

    • huggingface.co
    Updated Dec 15, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fatih Kadir Akın (2023). awesome-chatgpt-prompts [Dataset]. https://huggingface.co/datasets/fka/awesome-chatgpt-prompts
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 15, 2023
    Authors
    Fatih Kadir Akın
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    🧠 Awesome ChatGPT Prompts [CSV dataset]

    This is a Dataset Repository of Awesome ChatGPT Prompts View All Prompts on GitHub

      License
    

    CC-0

  7. Dolly 15k Dutch

    • zenodo.org
    • huggingface.co
    • +1more
    bin
    Updated Jun 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bram Vanroy; Bram Vanroy (2023). Dolly 15k Dutch [Dataset]. http://doi.org/10.57967/hf/0785
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 20, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Bram Vanroy; Bram Vanroy
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    This dataset contains 14,934 instructions, contexts and responses, in several natural language categories such as classification, closed QA, generation, etc. The English original dataset was created by @databricks, who crowd-sourced the data creation via its employees. The current dataset is a translation of that dataset through ChatGPT (gpt-3.5-turbo).

    Data Instances

    {
     "id": 14963,
     "instruction": "Wat zijn de duurste steden ter wereld?",
     "context": "",
     "response": "Dit is een uitgebreide lijst van de duurste steden: Singapore, Tel Aviv, New York, Hong Kong, Los Angeles, Zurich, Genève, San Francisco, Parijs en Sydney.",
     "category": "brainstorming"
    }
    

    Data Fields

    • id: the ID of the item. The following 77 IDs are not included because they could not be translated (or were too long): [1502, 1812, 1868, 4179, 4541, 6347, 8851, 9321, 10588, 10835, 11257, 12082, 12319, 12471, 12701, 12988, 13066, 13074, 13076, 13181, 13253, 13279, 13313, 13346, 13369, 13446, 13475, 13528, 13546, 13548, 13549, 13558, 13566, 13600, 13603, 13657, 13668, 13733, 13765, 13775, 13801, 13831, 13906, 13922, 13923, 13957, 13967, 13976, 14028, 14031, 14045, 14050, 14082, 14083, 14089, 14110, 14155, 14162, 14181, 14187, 14200, 14221, 14222, 14281, 14473, 14475, 14476, 14587, 14590, 14667, 14685, 14764, 14780, 14808, 14836, 14891, 1 4966]
    • instruction: the instruction (question)
    • context: additional context that the AI can use to answer the question
    • response: the AI's expected response
    • category: the category of this type of question (see Dolly for more info)

    Dataset Creation

    Both the translations and the topics were translated with OpenAI's API for gpt-3.5-turbo. max_tokens=1024, temperature=0 as parameters.

    The prompt template to translate the input is (where src_lang was English and tgt_lang Dutch):

    CONVERSATION_TRANSLATION_PROMPT = """You are asked to translate a task's instruction, optional context to the task, and the response to the task, from {src_lang} to {tgt_lang}.
    
    Here are the requirements that you should adhere to:
    1. maintain the format: the task consists of a task instruction (marked `instruction: `), optional context to the task (marked `context: `) and response for the task marked with `response: `;
    2. do not translate the identifiers `instruction: `, `context: `, and `response: ` but instead copy them to your output;
    3. make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias;
    4. translate the instruction and context text using informal, but standard, language;
    5. make sure to avoid biases (such as gender bias, grammatical bias, social bias);
    6. if the instruction is to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in the context in {tgt_lang}, and then also generate a corrected output version in the output in {tgt_lang};
    7. if the instruction is to translate text from one language to another, then you do not translate the text that needs to be translated in the instruction or the context, nor the translation in the response (just copy them as-is);
    8. do not translate code fragments but copy them to your output. If there are English examples, variable names or definitions in code fragments, keep them in English.
    
    Now translate the following task with the requirements set out above. Do not provide an explanation and do not add anything else.
    
    """
    

    The system message was:

    You are a helpful assistant that translates English to Dutch according to the requirements that are given to you.
    

    Note that 77 items (0.5%) were not successfully translated. This can either mean that the prompt was too long for the given limit (max_tokens=1024) or that the generated translation could not be parsed into instruction, context and response fields. The missing IDs are [1502, 1812, 1868, 4179, 4541, 6347, 8851, 9321, 10588, 10835, 11257, 12082, 12319, 12471, 12701, 12988, 13066, 13074, 13076, 13181, 13253, 13279, 13313, 13346, 13369, 13446, 13475, 13528, 13546, 13548, 13549, 13558, 13566, 13600, 13603, 13657, 13668, 13733, 13765, 13775, 13801, 13831, 13906, 13922, 13923, 13957, 13967, 13976, 14028, 14031, 14045, 14050, 14082, 14083, 14089, 14110, 14155, 14162, 14181, 14187, 14200, 14221, 14222, 14281, 14473, 14475, 14476, 14587, 14590, 14667, 14685, 14764, 14780, 14808, 14836, 14891, 1 4966].

    Initial Data Collection and Normalization

    Initial data collection by databricks. See their repository for more information about this dataset.

    Considerations for Using the Data

    Note that the translations in this new dataset have not been verified by humans! Use at your own risk, both in terms of quality and biases.

    Discussion of Biases

    As with any machine-generated texts, users should be aware of potential biases that are included in this dataset. Although the prompt specifically includes make sure to avoid biases (such as gender bias, grammatical bias, social bias), of course the impact of such command is not known. It is likely that biases remain in the dataset so use with caution.

    Other Known Limitations

    The translation quality has not been verified. Use at your own risk!

    Licensing Information

    This repository follows the original databricks license, which is CC BY-SA 3.0 but see below for a specific restriction.

    This text was generated (either in part or in full) with GPT-3 (gpt-3.5-turbo), OpenAI’s large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication.

    If you use this dataset, you must also follow the Sharing and Usage policies.

    As clearly stated in their Terms of Use, specifically 2c.iii, "[you may not] use output from the Services to develop models that compete with OpenAI". That means that you cannot use this dataset to build models that are intended to commercially compete with OpenAI. As far as I am aware, that is a specific restriction that should serve as an addendum to the current license.

    This dataset is also available on the Hugging Face hub, its canonical repository.

  8. R

    Monarch Butterfly Detector Dataset

    • universe.roboflow.com
    zip
    Updated Jun 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scott Cole (2023). Monarch Butterfly Detector Dataset [Dataset]. https://universe.roboflow.com/scott-cole-a3ty4/monarch-butterfly-detector/model/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 11, 2023
    Dataset authored and provided by
    Scott Cole
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Monarch Butterfly Bounding Boxes
    Description

    Monarch Butterfly Detector

    The Monarch Butterfly Detector is an advanced computer vision model that detects and localizes Monarch butterflies in images. With its cutting-edge technology and high accuracy, this model opens up exciting possibilities for biodiversity monitoring, migration studies, citizen science projects, identification guides, and environmental education.

    Key Features

    • Accurate Detection: The Monarch Butterfly Detector utilizes state-of-the-art computer vision algorithms to accurately identify and localize Monarch butterflies within images.

    • Versatile Use Cases: This powerful model has diverse applications, ranging from scientific research and conservation efforts to citizen science projects and environmental education programs.

    • Easy Integration: The Monarch Butterfly Detector can be seamlessly integrated into existing platforms, apps, or websites, making it accessible to many users and enabling them to contribute effortlessly to butterfly monitoring.

    • User-Friendly Interface: We provide a user-friendly interface/API that allows users to easily interact with the model, upload images, and obtain instant results.

    For detailed documentation and tutorials on using Roboflow, refer to the https://docs.roboflow.com/

    Contribution Guidelines

    We welcome contributions from the open-source community to enhance the Monarch Butterfly Detector. If you're interested in contributing, please follow the guidelines outlined in [CONTRIBUTING.md] and submit your pull requests.

    License

    This project is licensed under the [Roboflow License]. For more information, see the [LICENSE] file provided by Roboflow.

    Contact Information

    For any questions, suggestions, or collaborations, please reach out to us at savetheworld at 150left.com

    Congratulations if you have made it this far. 🥳

    🎁🎁🎁I thought I would use another piece of AI to get some more content about this detector I created. Here is the dialog I had with ChatGPT.

    Scott: ChatGPT, can you make my readers drool over my Monarch Butterfly Detector model?

    ChatGPT: Here are 10 captivating outro or call-to-action suggestions to encourage users to try out the Monarch Butterfly Detector and contribute to its success:

    "Join us in preserving the delicate beauty of Monarch butterflies. Try out the Monarch Butterfly Detector today and become a guardian of these magnificent creatures."

    "Unlock the secrets of Monarch butterflies with our cutting-edge detection model. Experience the wonders of computer vision and contribute to vital research efforts by using the Monarch Butterfly Detector."

    "Make a difference in conservation by being a part of the Monarch Butterfly Detector community. Help protect these iconic insects and their habitats with the power of AI and computer vision."

    "Calling all nature enthusiasts and citizen scientists! Contribute to the Monarch Butterfly Detector project and help us gather valuable data on Monarch butterflies. Together, we can make a lasting impact."

    "Explore the hidden world of Monarch butterflies with the Monarch Butterfly Detector. Whether you're a researcher, educator, or simply curious, join us on this journey and unravel the mysteries of these captivating creatures."

    "Become a Monarch detective! Unleash the potential of computer vision and aid in the conservation of Monarch butterflies. Download the Monarch Butterfly Detector and help us protect these extraordinary pollinators."

    "Attention conservationists and researchers! The Monarch Butterfly Detector needs your expertise. Contribute your insights and collaborate with us to further our understanding of Monarch butterflies and support their conservation."

    "Empower yourself with AI and contribute to science. With the Monarch Butterfly Detector, you can be a part of groundbreaking research on Monarch butterflies. Start detecting, start making a difference."

    "Inspire the next generation of environmentalists. Use the Monarch Butterfly Detector as an educational tool and engage students in the wonders of computer vision and the importance of protecting our natural world."

    "Experience the thrill of discovery with the Monarch Butterfly Detector. Join a vibrant community of nature enthusiasts and scientists, and together, let's unlock new insights into the fascinating lives of Monarch butterflies."

    As you can see, AI certainly likes my idea.

  9. Datasets and scripts related to the paper: "*Can Generative AI Help us in...

    • zenodo.org
    zip
    Updated Jul 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous Anonymous; Anonymous Anonymous (2024). Datasets and scripts related to the paper: "*Can Generative AI Help us in Qualitative Software Engineering?*" [Dataset]. http://doi.org/10.5281/zenodo.13134104
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 30, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous Anonymous; Anonymous Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    This replication package contains datasets and scripts related to the paper: "*Can Generative AI Help us in Qualitative Software Engineering?*"
    The replication package is organized into two directories:
    - `manual_analysis`: This directory contains all sheets used to perform the manual analysis for RQ1, RQ2, and RQ3.
    - `stats`: This directory contains all datasets, scripts, and results metrics used for the quantitative analyses of RQ1 and RQ2.
    In the following, we describe the content of each directory:
    ## manual_analysis
    - `manual_analysis_rq1`: This directory contains all sheets used to perform manual analysis for RQ1 (independent and incremental coding).
    - The sub-directory `incremental_coding` contains .csv files for all datasets (`DL_Faults_COMMIT_incremental.csv`, `DL_Faults_ISSUE_incremental.csv`, `DL_Fault_SO_incremental.csv`, `DRL_Challenges_incremental.csv` and `Functional_incremental.csv`). All these .csv files contain the following columns:
    - *Link*: The link to the instances
    - *Prompt*: Prompt used as input to GPT-4-Turbo
    - *ID*: Instance ID
    - *FinalTag*: Tag assigned by the human in the original paper
    - *Chatgpt\_output\_memory*: Output of GPT-4-Turbo with incremental coding
    - *Chatgpt\_output\_memory\_clean*: (only for the DL Faults datasets) output of GPT-4-Turbo considering only the label assigned, excluding the text
    - *Author1*: Label assigned by the first author
    - *Author2*: Label assigned by the second author
    - *FinalOutput*: Label assigned after the resolution of the conflicts
    - The sub-directory `independent_coding` contains .csv files for all datasets (`DL_Faults_COMMIT_independent.csv`, `DL_Faults_ISSUE_ independent.csv`, `DL_Fault_SO_ independent.csv`, `DRL_Challenges_ independent.csv` and `Functional_ independent.csv`), containing the following columns:
    - *Link*: The link to the instances
    - *Prompt*: Prompt used as input to GPT-4-Turbo
    - *ID*: Specific ID for the instance
    - *FinalTag*: Tag assigned by the human in the original paper
    - *Chatgpt\_output*: Output of GPT-4-Turbo with independent coding
    - *Chatgpt\_output\_clean*: (only for DL Faults datasets) output of GPT-4-Turbo considering only the label assigned, excluding the text
    - *Author1*: Label assigned by the first author
    - *Author2*: Label assigned by the second author
    - *FinalOutput*: Label assigned after the resolution of the conflicts.
    - Also, the sub-directory contains sheets with inconsistencies after resolving conflicts. The directory `inconsistency_incremental_coding` contains .csv files with the following columns:
    - *Dataset*: The dataset considered
    - *Human*: The label assigned by the human in the original paper
    - *Machine*: The label assigned by GPT-4-Turbo
    - *Classification*: The final label assigned by the authors after resolving the conflicts. Multiple classifications for a single instance are separated by a comma “,”
    - *Final*: final label assigned after the resolution of the incompatibilities
    - Similarly, the sub-directory `inconsistency_independent_coding` contains a .csv file with the same columns as before, but this is for the case of independent coding.
    - `manual_analysis_rq2`: This directory contains .csv files for all datasets (`DL_Faults_redundant_tag.csv`, `DRL_Challenges_redundant_tag.csv`, `Functional_redundant_tag.csv`) to perform manual analysis for RQ2.
    - The `DL_Faults_redundant_tag.csv` file contains the following columns:
    - *Tags Redundant*: tags identified as redundant by GPT-4-Turbo
    - *Matched*: inspection by the authors to see if the tags are redundant matching or not
    - *FinalTag*: final tag assigned by the authors after the resolution of the conflict
    - The `Functional_redundant_tag.csv` file contains the same columns as before
    - The `DRL_Challenges_redundant_tag.csv` file is organized as follows:
    - *Tags Suggested*: The final tag suggested by GPT-4-Turbo
    - *Tags Redundant*: tags identified as redundant by GPT-4-Turbo
    - *Matched*: inspection by the authors to see if the tags redundant matching or not with the tags suggested
    - *FinalTag*: final tag assigned by the authors after the resolution of the conflict
    - The sub-directory `code_consolidation_mapping_overview` contains .csv files (`DL_Faults_rq2_overview.csv`, `DRL_Challenges_rq2_overview.csv`, `Functional_rq2_overview.csv`) organized as follows:
    - *Initial_Tags*: list of the unique initial tags assigned by GPT-4-Turbo for each dataset
    - *Mapped_tags*: list of tags mapped by GPT-4-Turbo
    - *Unmatched_tags*: list of unmatched tags by GPT-4-Turbo
    - *Aggregating_tags*: list of consolidated tags
    - *Final_tags*: list of final tags after the consolidation task
    ## stats
    - `RQ1`: contains script and datasets used to perform metrics for RQ1. The analysis calculates all possible combinations between Matched, More Abstract, More Specific, and Unmatched.
    - `RQ1_Stats.ipynb` is a Python Jupyter nooteook to compute the RQ1 metrics. To use it, as explained in the notebook, it is necessary to change the values of variables contained in the first code block.
    - `independent-prompting`: Contains the datasets related to the independent prompting. Each line contains the following fields:
    - *Link*: Link to the artifact being tagged
    - *Prompt*: Prompt sent to GPT-4-Turbo
    - *FinalTag*: Artifact coding from the replicated study
    - *chatgpt\_output_text*: GPT-4-Turbo output
    - *chatgpt\_output*: Codes parsed from the GPT-4-Turbo output
    - *Author1*: Annotator 1 evaluation of the coding
    - *Author2*: Annotator 2 evaluation of the coding
    - *FinalOutput*: Consolidated evaluation
    - `incremental-prompting`: Contains the datasets related to the incremental prompting (same format as independent prompting)
    - `results`: contains files for the RQ1 quantitative results. The files are named `RQ1\_<
    - `RQ2`: contains the script used to perform metrics for RQ2, the datasets it uses, and its output.
    - `RQ2_SetStats.ipynb` is the Python Jupyter notebook to perform the analyses. The scripts takes as input the following types of files, contained in the directory contains the script used to perform the metrics for RQ2. The script takes in input:
    - RQ1 Data Files (`RQ1_DLFaults_Issues.csv`, `RQ1_DLFaults_Commits.csv`, and `RQ1_DLFaults_SO.csv`, joined in a single .csv `RQ1_DLFaults.csv`). These are the same files used in RQ1.
    - Mapping Files (`RQ2_Mappings_DRL.csv`, `RQ2_Mappings_Functional.csv`, `RQ2_Mappings_DLFaults.csv`). These contain the mappings between human tags (*HumanTags*), GPT-4-Turbo tags (*Final Tags*), with indicated the type of matching (*MatchType*).
    - Additional codes creating during the consolidation (`RQ2_newCodes_DRL.csv`, `RQ2_newCodes_Functional.csv`, `RQ2_newCodes_DLFaults.csv`), annotated with the matching: *new code*,*old code*,*human code*,*match type*
    - Set files (`RQ2_Sets_DRL.csv`, `RQ2_Sets_Functional.csv`, `RQ2_Sets_DLFaults.csv`). Each file contains the following columns:
    - *HumanTags*: List of tags from the original dataset
    - *InitialTags*: Set of tags from RQ1,
    - *ConsolidatedTags*: Tags that have been consolidated,
    - *FinalTags*: Final set of tags (results of RQ2, used in RQ3)
    - *NewTags*: New tags created during consolidation
    - `RQ2_Set_Metrics.csv`: Reports the RQ2 output metrics (Precision, Recall, F1-Score, Jaccard).
  10. Sepsis Dataset –

    • kaggle.com
    Updated May 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fatolu Peter (2025). Sepsis Dataset – [Dataset]. https://www.kaggle.com/datasets/olagokeblissman/sepsis-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 31, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Fatolu Peter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    📝 Dataset Overview: This dataset focuses on early warning detection for sepsis, a critical and potentially fatal medical condition. It includes anonymized vital signs, lab results, and clinical indicators of patients admitted to the hospital, structured for real-time monitoring and predictive modeling.

    It’s ideal for clinical data analysts, healthcare data scientists, and AI practitioners aiming to develop decision support tools, early warning dashboards, or predictive health models.

    🔍 Dataset Features: Column Name Description Patient_ID Unique anonymized identifier Admission_Date Patient’s hospital admission date Temperature_C Body temperature in degrees Celsius BP_Systolic Systolic blood pressure (mmHg) BP_Diastolic Diastolic blood pressure (mmHg) Heart_Rate Beats per minute WBC_Count White blood cell count (x10⁹/L) Lactate_mmol_L Lactate level in mmol/L Sepsis_Flag Binary indicator (1 = Suspected Sepsis, 0 = Normal) Ward Hospital ward/unit Doctor_On_Duty Attending physician name (anonymized)

    🎯 Use Cases: Build Power BI dashboards for hospital early warning systems

    Train ML classification models to detect early signs of sepsis

    Create patient monitoring tools with Python or R

    Explore the relationship between vitals & sepsis onset

    Perform feature engineering for risk scoring systems

    📌 Clinical Relevance: Sepsis is one of the leading causes of in-hospital mortality worldwide. Early detection is crucial to reducing death rates and improving outcomes. This dataset empowers developers and analysts to make a meaningful impact in the healthcare sector.

    👤 Created By: Fatolu Peter (Emperor Analytics) A passionate healthcare analyst leveraging data to drive innovation in public health across Nigeria. This is Project 12 in my data-for-good series.

    ✅ LinkedIn Post: 🚨 New Dataset: Sepsis Early Warning System Data – Now on Kaggle 📊 Clinical vital signs + lab markers + sepsis risk flags 🔗 Explore the dataset here

    This dataset enables healthcare data scientists to: ✅ Build real-time hospital dashboards ✅ Predict sepsis risk with machine learning ✅ Explore vitals like BP, lactate, WBC, and temperature ✅ Support early intervention using data insights

    Whether you're into: 🧠 Predictive modeling 📈 Power BI clinical dashboards 📉 Risk analytics in healthcare This is for you.

    Join me in using data to save lives — one insight at a time. If you build something, tag me. I’ll gladly share it! 💡

    HealthcareAnalytics #SepsisAwareness #EarlyWarningSystems #KaggleDataset #PowerBI #DataForGood #FatoluPeter #EmperorAnalytics #PublicHealth #Project12 #RealWorldData

    Let me know if you’d like help starting a Power BI or Python model for this!

    You said: whic license is the best to choose that is global

    ChatGPT said: For globally recognized and widely accepted dataset licensing on Kaggle, the best license to choose depends on your intent — especially whether you want others to reuse and build upon your dataset freely.

    ✅ Recommended License: Creative Commons Attribution 4.0 (CC BY 4.0) Why CC BY 4.0 is best globally: 🌍 Internationally recognized and legally valid worldwide.

    ✅ Allows others to use, share, adapt, and build on your dataset, even for commercial purposes.

    📢 Requires only attribution to you as the creator.

    🔐 You still retain copyright while maximizing openness.

    When to choose it: If your goal is to:

    Share freely with the global community,

    Allow use in academic, commercial, or public projects,

    Gain credit and visibility as the original creator.

    ⚠️ Other license types (if needed): CC BY-NC 4.0 (Attribution-NonCommercial): Only for non-commercial use.

    CC0 (Public Domain Dedication): Freest use; no attribution required — but not always ideal if you want credit.

    GPL / Open Data Commons: More for software or structured databases with specific open-source obligations.

    ✅ Final Suggestion for Your Datasets: Use CC BY 4.0 for all your uploads, unless you have confidential/private data. It's perfect for:

    Healthcare datasets

    Sales/retail analytics

    Kaggle portfolio building

    Global recognition & impact

  11. Customer Shopping Trends Dataset

    • kaggle.com
    Updated Oct 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sourav Banerjee (2023). Customer Shopping Trends Dataset [Dataset]. https://www.kaggle.com/datasets/iamsouravbanerjee/customer-shopping-trends-dataset/
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 5, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sourav Banerjee
    Description

    Context

    The Customer Shopping Preferences Dataset offers valuable insights into consumer behavior and purchasing patterns. Understanding customer preferences and trends is critical for businesses to tailor their products, marketing strategies, and overall customer experience. This dataset captures a wide range of customer attributes including age, gender, purchase history, preferred payment methods, frequency of purchases, and more. Analyzing this data can help businesses make informed decisions, optimize product offerings, and enhance customer satisfaction. The dataset stands as a valuable resource for businesses aiming to align their strategies with customer needs and preferences. It's important to note that this dataset is a Synthetic Dataset Created for Beginners to learn more about Data Analysis and Machine Learning.

    Content

    This dataset encompasses various features related to customer shopping preferences, gathering essential information for businesses seeking to enhance their understanding of their customer base. The features include customer age, gender, purchase amount, preferred payment methods, frequency of purchases, and feedback ratings. Additionally, data on the type of items purchased, shopping frequency, preferred shopping seasons, and interactions with promotional offers is included. With a collection of 3900 records, this dataset serves as a foundation for businesses looking to apply data-driven insights for better decision-making and customer-centric strategies.

    Dataset Glossary (Column-wise)

    • Customer ID - Unique identifier for each customer
    • Age - Age of the customer
    • Gender - Gender of the customer (Male/Female)
    • Item Purchased - The item purchased by the customer
    • Category - Category of the item purchased
    • Purchase Amount (USD) - The amount of the purchase in USD
    • Location - Location where the purchase was made
    • Size - Size of the purchased item
    • Color - Color of the purchased item
    • Season - Season during which the purchase was made
    • Review Rating - Rating given by the customer for the purchased item
    • Subscription Status - Indicates if the customer has a subscription (Yes/No)
    • Shipping Type - Type of shipping chosen by the customer
    • Discount Applied - Indicates if a discount was applied to the purchase (Yes/No)
    • Promo Code Used - Indicates if a promo code was used for the purchase (Yes/No)
    • Previous Purchases - The total count of transactions concluded by the customer at the store, excluding the ongoing transaction
    • Payment Method - Customer's most preferred payment method
    • Frequency of Purchases - Frequency at which the customer makes purchases (e.g., Weekly, Fortnightly, Monthly)

    Structure of the Dataset

    https://i.imgur.com/6UEqejq.png" alt="">

    Acknowledgement

    This dataset is a synthetic creation generated using ChatGPT to simulate a realistic customer shopping experience. Its purpose is to provide a platform for beginners and data enthusiasts, allowing them to create, enjoy, practice, and learn from a dataset that mirrors real-world customer shopping behavior. The aim is to foster learning and experimentation in a simulated environment, encouraging a deeper understanding of data analysis and interpretation in the context of consumer preferences and retail scenarios.

    Cover Photo by: Freepik

    Thumbnail by: Clothing icons created by Flat Icons - Flaticon

  12. h

    Data from: libertarian

    • huggingface.co
    Updated Jan 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Infinity (2025). libertarian [Dataset]. https://huggingface.co/datasets/CineAI/libertarian
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 13, 2025
    Authors
    Infinity
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    The dataset was created using ChatGPT-4 and consists of statements categorized into three groups: libertarian statements, statements that do not align with libertarian ideology and vision, and mixed statements. Each entry includes a statement alongside an evaluation generated by ChatGPT-4. The primary focus of the evaluation is to assess how closely the statement aligns with libertarian principles and to provide a detailed explanation of the reasoning behind the assessment. The dataset can… See the full description on the dataset page: https://huggingface.co/datasets/CineAI/libertarian.

  13. a

    WizardLM_Orca

    • aifasthub.com
    • huggingface.co
    Updated Sep 4, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pankaj Mathur (2025). WizardLM_Orca [Dataset]. https://aifasthub.com/datasets/pankajmathur/WizardLM_Orca
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 4, 2025
    Authors
    Pankaj Mathur
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Explain tuned WizardLM dataset ~55K created using approaches from Orca Research Paper. We leverage all of the 15 system instructions provided in Orca Research Paper. to generate custom datasets, in contrast to vanilla instruction tuning approaches used by original datasets. This helps student models like orca_mini_13b to learn thought process from teacher model, which is ChatGPT (gpt-3.5-turbo-0301 version). Please see how the System prompt is added before each instruction.

  14. h

    ContextToQuestions

    • huggingface.co
    Updated Mar 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Moustapha (2024). ContextToQuestions [Dataset]. https://huggingface.co/datasets/mito0o852/ContextToQuestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 16, 2024
    Authors
    Moustapha
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Context-Based Question Generation Dataset

    This dataset is designed for context-based question generation, where questions of different types (true/false, multiple-choice, open-ended) are generated based on a given context. The dataset is synthetically created using ChatGPT, providing a diverse set of questions to test comprehension and reasoning skills.

      Dataset Structure
    

    The dataset is structured with the following fields for each example:

    context: The context provided… See the full description on the dataset page: https://huggingface.co/datasets/mito0o852/ContextToQuestions.

  15. h

    ChatGPT-Jailbreak-Prompts

    • huggingface.co
    Updated Jun 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rubén Darío Jaramillo Romero (2023). ChatGPT-Jailbreak-Prompts [Dataset]. https://huggingface.co/datasets/rubend18/ChatGPT-Jailbreak-Prompts
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 19, 2023
    Authors
    Rubén Darío Jaramillo Romero
    Description

    Dataset Card for Dataset Name

      Name
    

    ChatGPT Jailbreak Prompts

      Dataset Summary
    

    ChatGPT Jailbreak Prompts is a complete collection of jailbreak related prompts for ChatGPT. This dataset is intended to provide a valuable resource for understanding and generating text in the context of jailbreaking in ChatGPT.

      Languages
    

    [English]

  16. h

    CodeVul-4omini-SFT

    • huggingface.co
    Updated Apr 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vu Hoang Anh (2025). CodeVul-4omini-SFT [Dataset]. https://huggingface.co/datasets/vamcrizer/CodeVul-4omini-SFT
    Explore at:
    Dataset updated
    Apr 3, 2025
    Authors
    Vu Hoang Anh
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset is made for code vulnerable detection by using ChatGPT 4o-mini via OpenAI's API. The main purpose is for SFT and further purposes

  17. h

    ConIR

    • huggingface.co
    Updated Apr 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    He Xingwei (2024). ConIR [Dataset]. https://huggingface.co/datasets/He-Xingwei/ConIR
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 15, 2024
    Authors
    He Xingwei
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    AnnoLLM

    This repo hosts the data for our NAACL 2024 paper "AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators".

      ConIR Dataset
    

    The conversation-based information retrieval (ConIR) dataset is created by ChatGPT based on the MS-MARCO passage ranking dataset. The ConIR dataset is available at https://huggingface.co/datasets/He-Xingwei/ConIR. The sizes of the training and test sets for ConIR are 71,557 and 3,000 respectively. When using it, please… See the full description on the dataset page: https://huggingface.co/datasets/He-Xingwei/ConIR.

  18. h

    MNLP_M2_rag_documents_chatGPT

    • huggingface.co
    Updated Jun 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arnault Stähli (2025). MNLP_M2_rag_documents_chatGPT [Dataset]. https://huggingface.co/datasets/arnaultsta/MNLP_M2_rag_documents_chatGPT
    Explore at:
    Dataset updated
    Jun 1, 2025
    Authors
    Arnault Stähli
    Description

    This dataset was created by prompting ChatGPT do give us context from the EPFL question's set

  19. h

    chatgpt-paraphrases-zh

    • huggingface.co
    Updated Jul 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shen Huang (2024). chatgpt-paraphrases-zh [Dataset]. https://huggingface.co/datasets/pangda/chatgpt-paraphrases-zh
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 23, 2024
    Authors
    Shen Huang
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This is a Chinese dataset of paraphrases created by ChatGPT. For English paraphrase dataset, you can refer to humarin/chatgpt-paraphrases.

      We used this prompt to generate paraphrases
    

    给下面这个问题生成5条相似的改写: {text} This dataset is based on the queries from Baidu and Zhihu. We generated 5 paraphrases for each sample, totally this dataset has about 238k data rows. You can make 30 rows from a row from each sample. In this way you can make 7.1 millions train pairs (238k rows with 5 paraphrases… See the full description on the dataset page: https://huggingface.co/datasets/pangda/chatgpt-paraphrases-zh.

  20. h

    dolly-v2_orca

    • huggingface.co
    Updated Jul 15, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pankaj Mathur (2018). dolly-v2_orca [Dataset]. https://huggingface.co/datasets/pankajmathur/dolly-v2_orca
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 15, 2018
    Authors
    Pankaj Mathur
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Explain tuned Dolly-V2 dataset ~15K created using approaches from Orca Research Paper. We leverage all of the 15 system instructions provided in Orca Research Paper to generate explain tuned datasets, in contrast to vanilla instruction tuning approaches used by original datasets. This helps student models like orca_mini_13b, orca_mini_7b or orca_mini_3b to learn thought process from teacher model, which is ChatGPT (gpt-3.5-turbo-0301 version). Please see how the System prompt is added before… See the full description on the dataset page: https://huggingface.co/datasets/pankajmathur/dolly-v2_orca.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Data Science Lovers (2025). AI Financial Market Data [Dataset]. https://www.kaggle.com/datasets/rohitgrewal/ai-financial-and-market-data
Organization logo

AI Financial Market Data

Analyse Financial Market Data of AI companies with Python

Explore at:
6 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 6, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Data Science Lovers
License

http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

Description

📹Project Video available on YouTube - https://youtu.be/WmJYHz_qn5s

Realistic Synthetic - AI Financial & Market Data for Gemini(Google), ChatGPT(OpenAI), Llama(Meta)

This dataset provides a synthetic, daily record of financial market activities related to companies involved in Artificial Intelligence (AI). There are key financial metrics and events that could influence a company's stock performance like launch of Llama by Meta, launch of GPT by OpenAI, launch of Gemini by Google etc. Here, we have the data about how much amount the companies are spending on R & D of their AI's Products & Services, and how much revenue these companies are generating. The data is from January 1, 2015, to December 31, 2024, and includes information for various companies : OpenAI, Google and Meta.

This data is available as a CSV file. We are going to analyze this data set using the Pandas DataFrame.

This analyse will be helpful for those working in Finance or Share Market domain.

From this dataset, we extract various insights using Python in our Project.

1) How much amount the companies spent on R & D ?

2) Revenue Earned by the companies

3) Date-wise Impact on the Stock

4) Events when Maximum Stock Impact was observed

5) AI Revenue Growth of the companies

6) Correlation between the columns

7) Expenditure vs Revenue year-by-year

8) Event Impact Analysis

9) Change in the index wrt Year & Company

These are the main Features/Columns available in the dataset :

1) Date: This column indicates the specific calendar day for which the financial and AI-related data is recorded. It allows for time-series analysis of the trends and impacts.

2) Company: This column specifies the name of the company to which the data in that particular row belongs. Examples include "OpenAI" and "Meta".

3) R&D_Spending_USD_Mn: This column represents the Research and Development (R&D) spending of the company, measured in Millions of USD. It serves as an indicator of a company's investment in innovation and future growth, particularly in the AI sector.

4) AI_Revenue_USD_Mn: This column denotes the revenue generated specifically from AI-related products or services, also measured in Millions of USD. This metric highlights the direct financial success derived from AI initiatives.

5) AI_Revenue_Growth_%: This column shows the percentage growth of AI-related revenue for the company on a daily basis. It indicates the pace at which a company's AI business is expanding or contracting.

6) Event: This column captures any significant events or announcements made by the company that could potentially influence its financial performance or market perception. Examples include "Cloud AI launch," "AI partnership deal," "AI ethics policy update," and "AI speech recognition release." These events are crucial for understanding sudden shifts in stock impact.

7) Stock_Impact_%: This column quantifies the percentage change in the company's stock price on a given day, likely in response to the recorded financial metrics or events. It serves as a direct measure of market reaction.

Search
Clear search
Close search
Google apps
Main menu