100+ datasets found
  1. AI Developer Performance Dataset

    • kaggle.com
    zip
    Updated May 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shahzad Aslam (2025). AI Developer Performance Dataset [Dataset]. https://www.kaggle.com/datasets/zeesolver/ai-developer-dataset
    Explore at:
    zip(5992 bytes)Available download formats
    Dataset updated
    May 27, 2025
    Authors
    Shahzad Aslam
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    This dataset contains 500 records and 9 features related to the productivity of developers using AI tools. It appears to analyze how factors like working habits, caffeine intake, and AI usage affect developer performance.

    Suggested Machine Learning Tasks

    • Binary classification (task_success)
    • Regression (e.g., predicting cognitive_load)
    • Clustering of work patterns
    • Correlation analysis & feature importance
    • Time series simulation & rolling averages (useful with synthetic date column)
    • Exploratory Data Analysis (EDA)
    • Anomaly detection (e.g., outliers in bugs_reported)
    • Multi-output regression (predicting commits and bugs_reported)
    • Dimensionality reduction (PCA or t-SNE for pattern visualization)
    • Decision rule extraction (e.g., tree-based rules for task_success) # 🧠 Inspiration Developers with balanced AI usage, sleep, and moderate coffee intake show higher task success. Overuse of AI or caffeine increases cognitive load, reducing effectiveness. Productivity thrives on smart work, not just hard work.

    📊 Column Descriptions

    • hours_coding – Daily coding hours (float).
    • coffee_intake_mg – Daily caffeine intake in milligrams (integer).
    • distractions – Number of distractions experienced (integer).
    • sleep_hours – Average sleep hours per day (float).
    • commits– Number of code commits per day (integer).
    • bugs_reported – Number of bugs reported (integer).
    • ai_usage_hours – Daily AI tool usage hours (float).
    • cognitive_load – Measured cognitive load on a scale (float).
    • task_success – Binary variable indicating task completion success (1 = success, 0 = fail).
  2. Data from: Multi-Source Distributed System Data for AI-powered Analytics

    • zenodo.org
    zip
    Updated Nov 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sasho Nedelkoski; Jasmin Bogatinovski; Ajay Kumar Mandapati; Soeren Becker; Jorge Cardoso; Odej Kao; Sasho Nedelkoski; Jasmin Bogatinovski; Ajay Kumar Mandapati; Soeren Becker; Jorge Cardoso; Odej Kao (2022). Multi-Source Distributed System Data for AI-powered Analytics [Dataset]. http://doi.org/10.5281/zenodo.3549604
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 10, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sasho Nedelkoski; Jasmin Bogatinovski; Ajay Kumar Mandapati; Soeren Becker; Jorge Cardoso; Odej Kao; Sasho Nedelkoski; Jasmin Bogatinovski; Ajay Kumar Mandapati; Soeren Becker; Jorge Cardoso; Odej Kao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract:

    In recent years there has been an increased interest in Artificial Intelligence for IT Operations (AIOps). This field utilizes monitoring data from IT systems, big data platforms, and machine learning to automate various operations and maintenance (O&M) tasks for distributed systems.
    The major contributions have been materialized in the form of novel algorithms.
    Typically, researchers took the challenge of exploring one specific type of observability data sources, such as application logs, metrics, and distributed traces, to create new algorithms.
    Nonetheless, due to the low signal-to-noise ratio of monitoring data, there is a consensus that only the analysis of multi-source monitoring data will enable the development of useful algorithms that have better performance.
    Unfortunately, existing datasets usually contain only a single source of data, often logs or metrics. This limits the possibilities for greater advances in AIOps research.
    Thus, we generated high-quality multi-source data composed of distributed traces, application logs, and metrics from a complex distributed system. This paper provides detailed descriptions of the experiment, statistics of the data, and identifies how such data can be analyzed to support O&M tasks such as anomaly detection, root cause analysis, and remediation.

    General Information:

    This repository contains the simple scripts for data statistics, and link to the multi-source distributed system dataset.

    You may find details of this dataset from the original paper:

    Sasho Nedelkoski, Jasmin Bogatinovski, Ajay Kumar Mandapati, Soeren Becker, Jorge Cardoso, Odej Kao, "Multi-Source Distributed System Data for AI-powered Analytics".

    If you use the data, implementation, or any details of the paper, please cite!

    BIBTEX:

    _

    @inproceedings{nedelkoski2020multi,
     title={Multi-source Distributed System Data for AI-Powered Analytics},
     author={Nedelkoski, Sasho and Bogatinovski, Jasmin and Mandapati, Ajay Kumar and Becker, Soeren and Cardoso, Jorge and Kao, Odej},
     booktitle={European Conference on Service-Oriented and Cloud Computing},
     pages={161--176},
     year={2020},
     organization={Springer}
    }
    

    _

    The multi-source/multimodal dataset is composed of distributed traces, application logs, and metrics produced from running a complex distributed system (Openstack). In addition, we also provide the workload and fault scripts together with the Rally report which can serve as ground truth. We provide two datasets, which differ on how the workload is executed. The sequential_data is generated via executing workload of sequential user requests. The concurrent_data is generated via executing workload of concurrent user requests.

    The raw logs in both datasets contain the same files. If the user wants the logs filetered by time with respect to the two datasets, should refer to the timestamps at the metrics (they provide the time window). In addition, we suggest to use the provided aggregated time ranged logs for both datasets in CSV format.

    Important: The logs and the metrics are synchronized with respect time and they are both recorded on CEST (central european standard time). The traces are on UTC (Coordinated Universal Time -2 hours). They should be synchronized if the user develops multimodal methods. Please read the IMPORTANT_experiment_start_end.txt file before working with the data.

    Our GitHub repository with the code for the workloads and scripts for basic analysis can be found at: https://github.com/SashoNedelkoski/multi-source-observability-dataset/

  3. AI Use Case - EDA

    • ai.tracebloc.io
    json
    Updated Nov 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    tracebloc (2025). AI Use Case - EDA [Dataset]. https://ai.tracebloc.io/explore/support-pilots-in-the-approach-and-landing-phase?tab=exploratory-data-analysis
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Nov 18, 2025
    Dataset provided by
    Tracebloc GmbH
    Authors
    tracebloc
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Missing Values
    Measurement technique
    Statistical and exploratory data analysis
    Description

    Comprehensive exploratory data analysis

  4. H

    AI TOOLS - Open Dataset - 4000 tools / 50 categories

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Sep 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Olivier BUREAU (2023). AI TOOLS - Open Dataset - 4000 tools / 50 categories [Dataset]. http://doi.org/10.7910/DVN/QLSXZG
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Olivier BUREAU
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Introducing a comprehensive and openly accessible dataset designed for researchers and data scientists in the field of artificial intelligence. This dataset encompasses a collection of over 4,000 AI tools, meticulously categorized into more than 50 distinct categories. This valuable resource has been generously shared by its owner, TasticAI, and is freely available for various purposes such as research, benchmarking, market surveys, and more. Dataset Overview: The dataset provides an extensive repository of AI tools, each accompanied by a wealth of information to facilitate your research endeavors. Here is a brief overview of the key components: AI Tool Name: Each AI tool is listed with its name, providing an easy reference point for users to identify specific tools within the dataset. Description: A concise one-line description is provided for each AI tool. This description offers a quick glimpse into the tool's purpose and functionality. AI Tool Category: The dataset is thoughtfully organized into more than 50 distinct categories, ensuring that you can easily locate AI tools that align with your research interests or project needs. Whether you are working on natural language processing, computer vision, machine learning, or other AI subfields, you will find a dedicated category. Images: Visual representation is crucial for understanding and identifying AI tools. To aid your exploration, the dataset includes images associated with each tool, allowing for quick recognition and visual association. Website Links: Accessing more detailed information about a specific AI tool is effortless, as direct links to the tool's respective website or documentation are provided. This feature enables researchers and data scientists to delve deeper into the tools that pique their interest. Utilization and Benefits: This openly shared dataset serves as a valuable resource for various purposes: Research: Researchers can use this dataset to identify AI tools relevant to their studies, facilitating faster literature reviews, comparative analyses, and the exploration of cutting-edge technologies. Benchmarking: The extensive collection of AI tools allows for comprehensive benchmarking, enabling you to evaluate and compare tools within specific categories or across categories. Market Surveys: Data scientists and market analysts can utilize this dataset to gain insights into the AI tool landscape, helping them identify emerging trends and opportunities within the AI market. Educational Purposes: Educators and students can leverage this dataset for teaching and learning about AI tools, their applications, and the categorization of AI technologies. Conclusion: In summary, this openly shared dataset from TasticAI, featuring over 4,000 AI tools categorized into more than 50 categories, represents a valuable asset for researchers, data scientists, and anyone interested in the field of artificial intelligence. Its easy accessibility, detailed information, and versatile applications make it an indispensable resource for advancing AI research, benchmarking, market analysis, and more. Explore the dataset at https://tasticai.com and unlock the potential of this rich collection of AI tools for your projects and studies.

  5. A

    Artificial Intelligence Training Dataset Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Feb 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). Artificial Intelligence Training Dataset Report [Dataset]. https://www.archivemarketresearch.com/reports/artificial-intelligence-training-dataset-38645
    Explore at:
    pdf, ppt, docAvailable download formats
    Dataset updated
    Feb 21, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global Artificial Intelligence (AI) Training Dataset market is projected to reach $1605.2 million by 2033, exhibiting a CAGR of 9.4% from 2025 to 2033. The surge in demand for AI training datasets is driven by the increasing adoption of AI and machine learning technologies in various industries such as healthcare, financial services, and manufacturing. Moreover, the growing need for reliable and high-quality data for training AI models is further fueling the market growth. Key market trends include the increasing adoption of cloud-based AI training datasets, the emergence of synthetic data generation, and the growing focus on data privacy and security. The market is segmented by type (image classification dataset, voice recognition dataset, natural language processing dataset, object detection dataset, and others) and application (smart campus, smart medical, autopilot, smart home, and others). North America is the largest regional market, followed by Europe and Asia Pacific. Key companies operating in the market include Appen, Speechocean, TELUS International, Summa Linguae Technologies, and Scale AI. Artificial Intelligence (AI) training datasets are critical for developing and deploying AI models. These datasets provide the data that AI models need to learn, and the quality of the data directly impacts the performance of the model. The AI training dataset market landscape is complex, with many different providers offering datasets for a variety of applications. The market is also rapidly evolving, as new technologies and techniques are developed for collecting, labeling, and managing AI training data.

  6. f

    Data Sheet 2_Large language models generating synthetic clinical datasets: a...

    • frontiersin.figshare.com
    • figshare.com
    xlsx
    Updated Feb 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin (2025). Data Sheet 2_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx [Dataset]. http://doi.org/10.3389/frai.2025.1533508.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Feb 5, 2025
    Dataset provided by
    Frontiers
    Authors
    Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.

  7. AI Platform Performance Dataset

    • kaggle.com
    zip
    Updated Sep 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Satya Prakash Swain (2024). AI Platform Performance Dataset [Dataset]. https://www.kaggle.com/datasets/satyaprakashswain/ai-platform-performance-dataset
    Explore at:
    zip(8734 bytes)Available download formats
    Dataset updated
    Sep 20, 2024
    Authors
    Satya Prakash Swain
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset compares the performance of various AI platforms across different tasks and metrics. It is designed for use in Kaggle competitions and analysis.

    Columns

    • Platform Name: Name of the AI platform or framework
    • Task Type: Type of AI task (e.g., Image Classification, Natural Language Processing, Object Detection)
    • Dataset: Name of the benchmark dataset used
    • Model Architecture: The specific model architecture used for the task
    • Accuracy: Accuracy score for the given task (percentage)
    • Training Time: Time taken to train the model (in hours)
    • Inference Time: Time taken for inference (in milliseconds)
    • GPU Memory Usage: GPU memory consumed during training (in GB)
    • Energy Consumption: Energy consumed during training (in kWh)
    • Date: Date of the performance measurement

    Notes

    • This dataset is synthetic and for demonstration purposes. Real-world performance may vary.
    • Performance metrics are collected under standardized conditions, but may not reflect all use cases.
    • Regular updates are recommended to keep the dataset current with the latest AI advancements.

    Potential Uses

    • Comparing AI platform performance across different tasks
    • Analyzing trade-offs between accuracy, speed, and resource consumption
    • Tracking improvements in AI platforms over time
    • Helping data scientists choose the most suitable platform for their specific needs
  8. AI Training Dataset Market Analysis, Size, and Forecast 2025-2029: North...

    • technavio.com
    pdf
    Updated Jul 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2025). AI Training Dataset Market Analysis, Size, and Forecast 2025-2029: North America (US and Canada), Europe (France, Germany, and UK), APAC (China, India, Japan, and South Korea), South America (Brazil), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/ai-training-dataset-market-industry-analysis
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jul 15, 2025
    Dataset provided by
    TechNavio
    Authors
    Technavio
    License

    https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice

    Time period covered
    2025 - 2029
    Area covered
    United Kingdom, Canada, United States
    Description

    Snapshot img

    AI Training Dataset Market Size 2025-2029

    The ai training dataset market size is valued to increase by USD 7.33 billion, at a CAGR of 29% from 2024 to 2029. Proliferation and increasing complexity of foundational AI models will drive the ai training dataset market.

    Market Insights

    North America dominated the market and accounted for a 36% growth during the 2025-2029.
    By Service Type - Text segment was valued at USD 742.60 billion in 2023
    By Deployment - On-premises segment accounted for the largest market revenue share in 2023
    

    Market Size & Forecast

    Market Opportunities: USD 479.81 million 
    Market Future Opportunities 2024: USD 7334.90 million
    CAGR from 2024 to 2029 : 29%
    

    Market Summary

    The market is experiencing significant growth as businesses increasingly rely on artificial intelligence (AI) to optimize operations, enhance customer experiences, and drive innovation. The proliferation and increasing complexity of foundational AI models necessitate large, high-quality datasets for effective training and improvement. This shift from data quantity to data quality and curation is a key trend in the market. Navigating data privacy, security, and copyright complexities, however, poses a significant challenge. Businesses must ensure that their datasets are ethically sourced, anonymized, and securely stored to mitigate risks and maintain compliance. For instance, in the supply chain optimization sector, companies use AI models to predict demand, optimize inventory levels, and improve logistics. Access to accurate and up-to-date training datasets is essential for these applications to function efficiently and effectively. Despite these challenges, the benefits of AI and the need for high-quality training datasets continue to drive market growth. The potential applications of AI are vast and varied, from healthcare and finance to manufacturing and transportation. As businesses continue to explore the possibilities of AI, the demand for curated, reliable, and secure training datasets will only increase.

    What will be the size of the AI Training Dataset Market during the forecast period?

    Get Key Insights on Market Forecast (PDF) Request Free SampleThe market continues to evolve, with businesses increasingly recognizing the importance of high-quality datasets for developing and refining artificial intelligence models. According to recent studies, the use of AI in various industries is projected to grow by over 40% in the next five years, creating a significant demand for training datasets. This trend is particularly relevant for boardrooms, as companies grapple with compliance requirements, budgeting decisions, and product strategy. Moreover, the importance of data labeling, feature selection, and imbalanced data handling in model performance cannot be overstated. For instance, a mislabeled dataset can lead to biased and inaccurate models, potentially resulting in costly errors. Similarly, effective feature selection algorithms can significantly improve model accuracy and reduce computational resources. Despite these challenges, advances in model compression methods, dataset scalability, and data lineage tracking are helping to address some of the most pressing issues in the market. For example, model compression techniques can reduce the size of models, making them more efficient and easier to deploy. Similarly, data lineage tracking can help ensure data consistency and improve model interpretability. In conclusion, the market is a critical component of the broader AI ecosystem, with significant implications for businesses across industries. By focusing on data quality, effective labeling, and advanced techniques for handling imbalanced data and improving model performance, organizations can stay ahead of the curve and unlock the full potential of AI.

    Unpacking the AI Training Dataset Market Landscape

    In the realm of artificial intelligence (AI), the significance of high-quality training datasets is indisputable. Businesses harnessing AI technologies invest substantially in acquiring and managing these datasets to ensure model robustness and accuracy. According to recent studies, up to 80% of machine learning projects fail due to insufficient or poor-quality data. Conversely, organizations that effectively manage their training data experience an average ROI improvement of 15% through cost reduction and enhanced model performance.

    Distributed computing systems and high-performance computing facilitate the processing of vast datasets, enabling businesses to train models at scale. Data security protocols and privacy preservation techniques are crucial to protect sensitive information within these datasets. Reinforcement learning models and supervised learning models each have their unique applications, with the former demonstrating a 30% faster convergence rate in certain use cases.

    Data annot

  9. G

    AI-Generated Synthetic Tabular Dataset Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Aug 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). AI-Generated Synthetic Tabular Dataset Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/ai-generated-synthetic-tabular-dataset-market
    Explore at:
    pptx, csv, pdfAvailable download formats
    Dataset updated
    Aug 4, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    AI-Generated Synthetic Tabular Dataset Market Outlook



    According to our latest research, the AI-Generated Synthetic Tabular Dataset market size reached USD 1.42 billion in 2024 globally, reflecting the rapid adoption of artificial intelligence-driven data generation solutions across numerous industries. The market is expected to expand at a robust CAGR of 34.7% from 2025 to 2033, reaching a forecasted value of USD 19.17 billion by 2033. This exceptional growth is primarily driven by the increasing need for high-quality, privacy-preserving datasets for analytics, model training, and regulatory compliance, particularly in sectors with stringent data privacy requirements.




    One of the principal growth factors propelling the AI-Generated Synthetic Tabular Dataset market is the escalating demand for data-driven innovation amidst tightening data privacy regulations. Organizations across healthcare, finance, and government sectors are facing mounting challenges in accessing and sharing real-world data due to GDPR, HIPAA, and other global privacy laws. Synthetic data, generated by advanced AI algorithms, offers a solution by mimicking the statistical properties of real datasets without exposing sensitive information. This enables organizations to accelerate AI and machine learning development, conduct robust analytics, and facilitate collaborative research without risking data breaches or non-compliance. The growing sophistication of generative models, such as GANs and VAEs, has further increased confidence in the utility and realism of synthetic tabular data, fueling adoption across both large enterprises and research institutions.




    Another significant driver is the surge in digital transformation initiatives and the proliferation of AI and machine learning applications across industries. As businesses strive to leverage predictive analytics, automation, and intelligent decision-making, the need for large, diverse, and high-quality datasets has become paramount. However, real-world data is often siloed, incomplete, or inaccessible due to privacy concerns. AI-generated synthetic tabular datasets bridge this gap by providing scalable, customizable, and bias-mitigated data for model training and validation. This not only accelerates AI deployment but also enhances model robustness and generalizability. The flexibility of synthetic data generation platforms, which can simulate rare events and edge cases, is particularly valuable in sectors like finance and healthcare, where such scenarios are underrepresented in real datasets but critical for risk assessment and decision support.




    The rapid evolution of the AI-Generated Synthetic Tabular Dataset market is also underpinned by technological advancements and growing investments in AI infrastructure. The availability of cloud-based synthetic data generation platforms, coupled with advancements in natural language processing and tabular data modeling, has democratized access to synthetic datasets for organizations of all sizes. Strategic partnerships between technology providers, research institutions, and regulatory bodies are fostering innovation and establishing best practices for synthetic data quality, utility, and governance. Furthermore, the integration of synthetic data solutions with existing data management and analytics ecosystems is streamlining workflows and reducing barriers to adoption, thereby accelerating market growth.




    Regionally, North America dominates the AI-Generated Synthetic Tabular Dataset market, accounting for the largest share in 2024 due to the presence of leading AI technology firms, strong regulatory frameworks, and early adoption across industries. Europe follows closely, driven by stringent data protection laws and a vibrant research ecosystem. The Asia Pacific region is emerging as a high-growth market, fueled by rapid digitalization, government initiatives, and increasing investments in AI research and development. Latin America and the Middle East & Africa are also witnessing growing interest, particularly in sectors like finance and government, though market maturity varies across countries. The regional landscape is expected to evolve dynamically as regulatory harmonization, cross-border data collaboration, and technological advancements continue to shape market trajectories globally.



  10. Claude.ai Usage Data

    • kaggle.com
    zip
    Updated Sep 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yash Dogra (2025). Claude.ai Usage Data [Dataset]. https://www.kaggle.com/datasets/yashdogra/anthropic
    Explore at:
    zip(2746225 bytes)Available download formats
    Dataset updated
    Sep 16, 2025
    Authors
    Yash Dogra
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Data Documentation

    This document describes the data sources and variables used in the third Anthropic Economic Index (AEI) report.

    Claude.ai Usage Data

    Overview

    The core dataset contains Claude AI usage metrics aggregated by geography and analysis dimensions (facets).

    Source files: - aei_raw_claude_ai_2025-08-04_to_2025-08-11.csv (pre-enrichment data in data/intermediate/) - aei_enriched_claude_ai_2025-08-04_to_2025-08-11.csv (enriched data in data/output/)

    Note on data sources: The AEI raw file contains raw counts and percentages. Derived metrics (indices, tiers, per capita calculations, automation/augmentation percentages) are calculated during the enrichment process in aei_report_v3_preprocessing_claude_ai.ipynb.

    Data Schema

    Each row represents one metric value for a specific geography and facet combination:

    ColumnTypeDescription
    geo_idstringGeographic identifier (ISO-2 country code for countries, US state code, or "GLOBAL", ISO-3 country codes in enriched data)
    geographystringGeographic level: "country", "state_us", or "global"
    date_startdateStart of data collection period
    date_enddateEnd of data collection period
    platform_and_productstring"Claude AI (Free and Pro)"
    facetstringAnalysis dimension (see Facets below)
    levelintegerSub-level within facet (0-2)
    variablestringMetric name (see Variables below)
    cluster_namestringSpecific entity within facet (task, pattern, etc.). For intersections, format is "base::category"
    valuefloatNumeric metric value

    Facets

    • country: Country-level aggregations
    • state_us: US state-level aggregations
    • onet_task: O*NET occupational tasks
    • collaboration: Human-AI collaboration patterns
    • request: Request complexity levels (0=highest granularity, 1=middle granularity, 2=lowest granularity)
    • onet_task::collaboration: Intersection of tasks and collaboration patterns
    • request::collaboration: Intersection of request categories and collaboration patterns

    Core Variables

    Variables follow the pattern {prefix}_{suffix} with specific meanings:

    From AEI processing: *_count, *_pct From enrichment: *_per_capita, *_per_capita_index, *_pct_index, *_tier, automation_pct, augmentation_pct, soc_pct

    Usage Metrics

    • usage_count: Total number of conversations/interactions in a geography
    • usage_pct: Percentage of total usage (relative to parent geography - gobal for countries, US for states)
    • usage_per_capita: Usage count divided by working age population
    • usage_per_capita_index: Concentration index showing if a geography has more/less usage than expected based on population share (1.0 = proportional, >1.0 = over-representation, <1.0 = under-representation)
    • usage_tier: Usage adoption tier (0 = no/little adoption, 1-4 = quartiles of adoption among geographies with sufficient usage)

    Content Facet Metrics

    O*NET Task Metrics: - onet_task_count: Number of conversations using this specific O*NET task - onet_task_pct: Percentage of geographic total using this task - onet_task_pct_index: Specialization index comparing task usage to baseline (global for countries, US for states) - onet_task_collaboration_count: Number of conversations with both this task and collaboration pattern (intersection) - onet_task_collaboration_pct: Percentage of the base task's total that has this collaboration pattern (sums to 100% within each task)

    Occupation Metrics

    • soc_pct: Percentage of classified O*NET tasks associated with this SOC major occupation group (e.g., Management, Computer and Mathematical)

    Request Metrics: - request_count: Number of conversations in this request category level - request_pct: Percentage of geographic total in this category - request_pct_index: Specialization index comparing request usage to baseline - request_collaboration_count: Number of conversations with both this request category and collaboration pattern (intersection) - request_collaboration_pct: Percentage of the base request's total that has this collaboration pattern (sums to 100% within each request)

    Collaboration Pattern Metrics: - collaboration_count: Number of conversations with this collaboration pattern - collaboration_pct: Percentage of geographic total with this pattern - collaboration_pct_index: Specialization index comparing pattern to baseline - automation_pct: Percentage of classifiable collaboration that is automation-focused (directive, feedback loop patterns) - augmentation_pct: Percentage of classifiable collaboration that is augmentation-focused (validation, task iteration, learning patterns)

    Demographic & Economic Metrics

    • ...
  11. A

    AI Training Dataset In Healthcare Market Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Jun 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). AI Training Dataset In Healthcare Market Report [Dataset]. https://www.archivemarketresearch.com/reports/ai-training-dataset-in-healthcare-market-5352
    Explore at:
    pdf, ppt, docAvailable download formats
    Dataset updated
    Jun 20, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    global
    Variables measured
    Market Size
    Description

    The AI Training Dataset In Healthcare Market size was valued at USD 341.8 million in 2023 and is projected to reach USD 1464.13 million by 2032, exhibiting a CAGR of 23.1 % during the forecasts period. The growth is attributed to the rising adoption of AI in healthcare, increasing demand for accurate and reliable training datasets, government initiatives to promote AI in healthcare, and technological advancements in data collection and annotation. These factors are contributing to the expansion of the AI Training Dataset In Healthcare Market. Healthcare AI training data sets are vital for building effective algorithms, and enhancing patient care and diagnosis in the industry. These datasets include large volumes of Electronic Health Records, images such as X-ray and MRI scans, and genomics data which are thoroughly labeled. They help the AI systems to identify trends, forecast and even help in developing unique approaches to treating the disease. However, patient privacy and ethical use of a patient’s information is of the utmost importance, thus requiring high levels of anonymization and compliance with laws such as HIPAA. Ongoing expansion and variety of datasets are crucial to address existing bias and improve the efficiency of AI for different populations and diseases to provide safer solutions for global people’s health.

  12. Generative AI In Data Analytics Market Analysis, Size, and Forecast...

    • technavio.com
    pdf
    Updated Jul 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2025). Generative AI In Data Analytics Market Analysis, Size, and Forecast 2025-2029: North America (US, Canada, and Mexico), Europe (France, Germany, and UK), APAC (China, India, and Japan), South America (Brazil), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/generative-ai-in-data-analytics-market-industry-analysis
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jul 17, 2025
    Dataset provided by
    TechNavio
    Authors
    Technavio
    License

    https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice

    Time period covered
    2025 - 2029
    Area covered
    United States
    Description

    Snapshot img

    Generative AI In Data Analytics Market Size 2025-2029

    The generative ai in data analytics market size is valued to increase by USD 4.62 billion, at a CAGR of 35.5% from 2024 to 2029. Democratization of data analytics and increased accessibility will drive the generative ai in data analytics market.

    Market Insights

    North America dominated the market and accounted for a 37% growth during the 2025-2029.
    By Deployment - Cloud-based segment was valued at USD 510.60 billion in 2023
    By Technology - Machine learning segment accounted for the largest market revenue share in 2023
    

    Market Size & Forecast

    Market Opportunities: USD 621.84 million 
    Market Future Opportunities 2024: USD 4624.00 million
    CAGR from 2024 to 2029 : 35.5%
    

    Market Summary

    The market is experiencing significant growth as businesses worldwide seek to unlock new insights from their data through advanced technologies. This trend is driven by the democratization of data analytics and increased accessibility of AI models, which are now available in domain-specific and enterprise-tuned versions. Generative AI, a subset of artificial intelligence, uses deep learning algorithms to create new data based on existing data sets. This capability is particularly valuable in data analytics, where it can be used to generate predictions, recommendations, and even new data points. One real-world business scenario where generative AI is making a significant impact is in supply chain optimization. In this context, generative AI models can analyze historical data and generate forecasts for demand, inventory levels, and production schedules. This enables businesses to optimize their supply chain operations, reduce costs, and improve customer satisfaction. However, the adoption of generative AI in data analytics also presents challenges, particularly around data privacy, security, and governance. As businesses continue to generate and analyze increasingly large volumes of data, ensuring that it is protected and used in compliance with regulations is paramount. Despite these challenges, the benefits of generative AI in data analytics are clear, and its use is set to grow as businesses seek to gain a competitive edge through data-driven insights.

    What will be the size of the Generative AI In Data Analytics Market during the forecast period?

    Get Key Insights on Market Forecast (PDF) Request Free SampleGenerative AI, a subset of artificial intelligence, is revolutionizing data analytics by automating data processing and analysis, enabling businesses to derive valuable insights faster and more accurately. Synthetic data generation, a key application of generative AI, allows for the creation of large, realistic datasets, addressing the challenge of insufficient data in analytics. Parallel processing methods and high-performance computing power the rapid analysis of vast datasets. Automated machine learning and hyperparameter optimization streamline model development, while model monitoring systems ensure continuous model performance. Real-time data processing and scalable data solutions facilitate data-driven decision-making, enabling businesses to respond swiftly to market trends. One significant trend in the market is the integration of AI-powered insights into business operations. For instance, probabilistic graphical models and backpropagation techniques are used to predict customer churn and optimize marketing strategies. Ensemble learning methods and transfer learning techniques enhance predictive analytics, leading to improved customer segmentation and targeted marketing. According to recent studies, businesses have achieved a 30% reduction in processing time and a 25% increase in predictive accuracy by implementing generative AI in their data analytics processes. This translates to substantial cost savings and improved operational efficiency. By embracing this technology, businesses can gain a competitive edge, making informed decisions with greater accuracy and agility.

    Unpacking the Generative AI In Data Analytics Market Landscape

    In the dynamic realm of data analytics, Generative AI algorithms have emerged as a game-changer, revolutionizing data processing and insights generation. Compared to traditional data mining techniques, Generative AI models can create new data points that mirror the original dataset, enabling more comprehensive data exploration and analysis (Source: Gartner). This innovation leads to a 30% increase in identified patterns and trends, resulting in improved ROI and enhanced business decision-making (IDC).

    Data security protocols are paramount in this context, with Classification Algorithms and Clustering Algorithms ensuring data privacy and compliance alignment. Machine Learning Pipelines and Deep Learning Frameworks facilitate seamless integration with Predictive Modeling Tools and Automated Report Generation on Cloud

  13. m

    AI usage and history students

    • data.mendeley.com
    Updated Jul 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marek Vokoun (2025). AI usage and history students [Dataset]. http://doi.org/10.17632/p77jf84r8m.1
    Explore at:
    Dataset updated
    Jul 22, 2025
    Authors
    Marek Vokoun
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset was designed to explore the factors that most significantly influence the level of AI tool usage among history students at Jan Evangelista Purkyně University (UJEP) in their academic work. It directly addresses three core research questions: - RQ1 investigates which perceptions, attitudes, and behavioral intentions shape AI tool usage. - RQ2 examines the antecedent factors—such as demographic characteristics, digital literacy, and epistemic orientation—that may predict AI engagement. - RQ3 probes whether frequent AI tool users represent a distinct subgroup, suggesting non-random selection into higher usage patterns. The central variable, "AIFrequency", captures students’ self-reported frequency of AI tool use in study-related tasks. This is an ordinal variable measured on a 5-point Likert scale, ranging from 1 ("Never") to 5 ("Daily or Almost Daily"). This variable serves as the primary outcome measure, enabling analysis of usage intensity across different student profiles. The survey also includes items aligned with Technology Acceptance Models (TAM and VAM), covering constructs such as perceived usefulness, ease of use, trust, and intention to continue using AI tools. These constructs are operationalized through multiple Likert-scale items, allowing for both descriptive and inferential statistical analysis. The dataset thus provides a rich foundation for examining not only how often students use AI tools, but also why they do so, and who is most likely to engage with them regularly.

  14. Z

    Data from: TWIGMA: A dataset of AI-Generated Images with Metadata From...

    • data.niaid.nih.gov
    • zenodo.org
    Updated May 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yiqun Chen; James Zou (2024). TWIGMA: A dataset of AI-Generated Images with Metadata From Twitter [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8031784
    Explore at:
    Dataset updated
    May 28, 2024
    Dataset provided by
    Stanford University
    Authors
    Yiqun Chen; James Zou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Update May 2024: Fixed a data type issue with "id" column that prevented twitter ids from rendering correctly.

    Recent progress in generative artificial intelligence (gen-AI) has enabled the generation of photo-realistic and artistically-inspiring photos at a single click, catering to millions of users online. To explore how people use gen-AI models such as DALLE and StableDiffusion, it is critical to understand the themes, contents, and variations present in the AI-generated photos. In this work, we introduce TWIGMA (TWItter Generative-ai images with MetadatA), a comprehensive dataset encompassing 800,000 gen-AI images collected from Jan 2021 to March 2023 on Twitter, with associated metadata (e.g., tweet text, creation date, number of likes).

    Through a comparative analysis of TWIGMA with natural images and human artwork, we find that gen-AI images possess distinctive characteristics and exhibit, on average, lower variability when compared to their non-gen-AI counterparts. Additionally, we find that the similarity between a gen-AI image and human images (i) is correlated with the number of likes; and (ii) can be used to identify human images that served as inspiration for the gen-AI creations. Finally, we observe a longitudinal shift in the themes of AI-generated images on Twitter, with users increasingly sharing artistically sophisticated content such as intricate human portraits, whereas their interest in simple subjects such as natural scenes and animals has decreased. Our analyses and findings underscore the significance of TWIGMA as a unique data resource for studying AI-generated images.

    Note that in accordance with the privacy and control policy of Twitter, NO raw content from Twitter is included in this dataset and users could and need to retrieve the original Twitter content used for analysis using the Twitter id. In addition, users who want to access Twitter data should consult and follow rules and regulations closely at the official Twitter developer policy at https://developer.twitter.com/en/developer-terms/policy.

  15. Z

    Data from: IA Tweets Analysis Dataset (Spanish)

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Aug 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guerrero-Contreras, Gabriel; Balderas-Díaz, Sara; Serrano-Fernández, Alejandro; Muñoz, Andrés (2024). IA Tweets Analysis Dataset (Spanish) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10821484
    Explore at:
    Dataset updated
    Aug 3, 2024
    Dataset provided by
    University of Cadiz
    Authors
    Guerrero-Contreras, Gabriel; Balderas-Díaz, Sara; Serrano-Fernández, Alejandro; Muñoz, Andrés
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    General Description

    This dataset comprises 4,038 tweets in Spanish, related to discussions about artificial intelligence (AI), and was created and utilized in the publication "Enhancing Sentiment Analysis on Social Media: Integrating Text and Metadata for Refined Insights," (10.1109/IE61493.2024.10599899) presented at the 20th International Conference on Intelligent Environments. It is designed to support research on public perception, sentiment, and engagement with AI topics on social media from a Spanish-speaking perspective. Each entry includes detailed annotations covering sentiment analysis, user engagement metrics, and user profile characteristics, among others.

    Data Collection Method

    Tweets were gathered through the Twitter API v1.1 by targeting keywords and hashtags associated with artificial intelligence, focusing specifically on content in Spanish. The dataset captures a wide array of discussions, offering a holistic view of the Spanish-speaking public's sentiment towards AI.

    Dataset Content

    ID: A unique identifier for each tweet.

    text: The textual content of the tweet. It is a string with a maximum allowed length of 280 characters.

    polarity: The tweet's sentiment polarity (e.g., Positive, Negative, Neutral).

    favorite_count: Indicates how many times the tweet has been liked by Twitter users. It is a non-negative integer.

    retweet_count: The number of times this tweet has been retweeted. It is a non-negative integer.

    user_verified: When true, indicates that the user has a verified account, which helps the public recognize the authenticity of accounts of public interest. It is a boolean data type with two allowed values: True or False.

    user_default_profile: When true, indicates that the user has not altered the theme or background of their user profile. It is a boolean data type with two allowed values: True or False.

    user_has_extended_profile: When true, indicates that the user has an extended profile. An extended profile on Twitter allows users to provide more detailed information about themselves, such as an extended biography, a header image, details about their location, website, and other additional data. It is a boolean data type with two allowed values: True or False.

    user_followers_count: The current number of followers the account has. It is a non-negative integer.

    user_friends_count: The number of users that the account is following. It is a non-negative integer.

    user_favourites_count: The number of tweets this user has liked since the account was created. It is a non-negative integer.

    user_statuses_count: The number of tweets (including retweets) posted by the user. It is a non-negative integer.

    user_protected: When true, indicates that this user has chosen to protect their tweets, meaning their tweets are not publicly visible without their permission. It is a boolean data type with two allowed values: True or False.

    user_is_translator: When true, indicates that the user posting the tweet is a verified translator on Twitter. This means they have been recognized and validated by the platform as translators of content in different languages. It is a boolean data type with two allowed values: True or False.

    Cite as

    Guerrero-Contreras, G., Balderas-Díaz, S., Serrano-Fernández, A., & Muñoz, A. (2024, June). Enhancing Sentiment Analysis on Social Media: Integrating Text and Metadata for Refined Insights. In 2024 International Conference on Intelligent Environments (IE) (pp. 62-69). IEEE.

    Potential Use Cases

    This dataset is aimed at academic researchers and practitioners with interests in:

    Sentiment analysis and natural language processing (NLP) with a focus on AI discussions in the Spanish language.

    Social media analysis on public engagement and perception of artificial intelligence among Spanish speakers.

    Exploring correlations between user engagement metrics and sentiment in discussions about AI.

    Data Format and File Type

    The dataset is provided in CSV format, ensuring compatibility with a wide range of data analysis tools and programming environments.

    License

    The dataset is available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license, permitting sharing, copying, distribution, transmission, and adaptation of the work for any purpose, including commercial, provided proper attribution is given.

  16. Data from: Enriching time series datasets using Nonparametric kernel...

    • figshare.com
    pdf
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamad Ivan Fanany (2023). Enriching time series datasets using Nonparametric kernel regression to improve forecasting accuracy [Dataset]. http://doi.org/10.6084/m9.figshare.1609661.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Mohamad Ivan Fanany
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Improving the accuracy of prediction on future values based on the past and current observations has been pursued by enhancing the prediction's methods, combining those methods or performing data pre-processing. In this paper, another approach is taken, namely by increasing the number of input in the dataset. This approach would be useful especially for a shorter time series data. By filling the in-between values in the time series, the number of training set can be increased, thus increasing the generalization capability of the predictor. The algorithm used to make prediction is Neural Network as it is widely used in literature for time series tasks. For comparison, Support Vector Regression is also employed. The dataset used in the experiment is the frequency of USPTO's patents and PubMed's scientific publications on the field of health, namely on Apnea, Arrhythmia, and Sleep Stages. Another time series data designated for NN3 Competition in the field of transportation is also used for benchmarking. The experimental result shows that the prediction performance can be significantly increased by filling in-between data in the time series. Furthermore, the use of detrend and deseasonalization which separates the data into trend, seasonal and stationary time series also improve the prediction performance both on original and filled dataset. The optimal number of increase on the dataset in this experiment is about five times of the length of original dataset.

  17. GenAI Prompt.

    • plos.figshare.com
    xls
    Updated Sep 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tanisha Jowsey; Peta Stapleton; Shawna Campbell; Alexandra Davidson; Cher McGillivray; Isabella Maugeri; Megan Lee; Justin Keogh (2025). GenAI Prompt. [Dataset]. http://doi.org/10.1371/journal.pone.0330217.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 5, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Tanisha Jowsey; Peta Stapleton; Shawna Campbell; Alexandra Davidson; Cher McGillivray; Isabella Maugeri; Megan Lee; Justin Keogh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ObjectiveTo determine accuracy and efficiency of using generative artificial intelligence (GenAI) to undertake thematic analysis.IntroductionWith the increasing use of GenAI in data analysis, testing the reliability and suitability of using GenAI to conduct qualitative data analysis is needed. We propose a method for researchers to assess reliability of GenAI outputs using deidentified qualitative datasets.MethodsWe searched three databases (United Kingdom Data Service, Figshare, and Google Scholar) and five journals (PlosOne, Social Science and Medicine, Qualitative Inquiry, Qualitative Research, Sociology Health Review) to identify studies on health-related topics, published prior to whereby: humans undertook thematic analysis and published both their analysis in a peer-reviewed journal and the associated dataset. We prompted a closed system GenAI (Microsoft Copilot) to undertake thematic analysis of these datasets and analysed the GenAI outputs in comparison with human outputs. Measures include time (GenAI only), accuracy, overlap with human analysis, and reliability of selected data and quotes.ResultsFive studies were identified that met our inclusion criteria. The themes identified by human researchers and Copilot showed minimal overlap, with human researchers often using discursive thematic analyses (40%) and Copilot focusing on thematic analysis (100%). Copilot’s outputs often included fabricated quotes (58% SD = 45%) and none of the Copilot outputs provided participant spread by theme. Additionally, Copilot’s outputs primarily drew themes and quotes from the first 2-3 pages of textual data, rather than from the entire dataset. Human researchers provided broader representation and accurate quotes (79% quotes were correct, SD = 27%).ConclusionsBased on these results, we cannot recommend the current version of Copilot for undertaking thematic analyses. This study raises concerns about the validity of both human-generated and GenAI-generated qualitative data analysis and reporting.

  18. Evaluate Drone-AI Models for Crowd & Traffic Monitoring - EDA

    • ai.tracebloc.io
    json
    Updated Nov 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    tracebloc (2025). Evaluate Drone-AI Models for Crowd & Traffic Monitoring - EDA [Dataset]. https://ai.tracebloc.io/explore/drones-object-detection-for-traffic-monitoring?tab=exploratory-data-analysis
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Nov 27, 2025
    Dataset provided by
    Tracebloc GmbH
    Authors
    tracebloc
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Missing Values
    Measurement technique
    Statistical and exploratory data analysis
    Description

    Discover, test and benchmark 3rd-party AI models for drone-based crowd and traffic detection — accuracy, latency & rare-object performance for enterprise use.

  19. c

    Sentiment Analysis Dataset

    • cubig.ai
    zip
    Updated May 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CUBIG (2025). Sentiment Analysis Dataset [Dataset]. https://cubig.ai/store/products/270/sentiment-analysis-dataset
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 20, 2025
    Dataset authored and provided by
    CUBIG
    License

    https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service

    Measurement technique
    Privacy-preserving data transformation via differential privacy, Synthetic data generation using AI techniques for model training
    Description

    1) Data Introduction • The Sentiment Analysis Dataset is a dataset for emotional analysis, including large-scale tweet text collected from Twitter and emotional polarity (0=negative, 2=neutral, 4=positive) labels for each tweet, featuring automatic labeling based on emoticons.

    2) Data Utilization (1) Sentiment Analysis Dataset has characteristics that: • Each sample consists of six columns: emotional polarity, tweet ID, date of writing, search word, author, and tweet body, and is suitable for training natural language processing and classification models using tweet text and emotion labels. (2) Sentiment Analysis Dataset can be used to: • Emotional Classification Model Development: Using tweet text and emotional polarity labels, we can build positive, negative, and neutral emotional automatic classification models with various machine learning and deep learning models such as logistic regression, SVM, RNN, and LSTM. • Analysis of SNS public opinion and trends: By analyzing the distribution of emotions by time series and keywords, you can explore changes in public opinion on specific issues or brands, positive and negative trends, and key emotional keywords.

  20. d

    80K+ Construction Site Images | AI Training Data | Machine Learning (ML)...

    • datarade.ai
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Seeds, 80K+ Construction Site Images | AI Training Data | Machine Learning (ML) data | Object & Scene Detection | Global Coverage [Dataset]. https://datarade.ai/data-products/50k-construction-site-images-ai-training-data-machine-le-data-seeds
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset authored and provided by
    Data Seeds
    Area covered
    Russian Federation, Senegal, United Arab Emirates, Guatemala, Swaziland, Tunisia, Grenada, Venezuela (Bolivarian Republic of), Kenya, Peru
    Description

    This dataset features over 80,000 high-quality images of construction sites sourced from photographers worldwide. Built to support AI and machine learning applications, it delivers richly annotated and visually diverse imagery capturing real-world construction environments, machinery, and processes.

    Key Features: 1. Comprehensive Metadata: the dataset includes full EXIF data such as aperture, ISO, shutter speed, and focal length. Each image is annotated with construction phase, equipment types, safety indicators, and human activity context—making it ideal for object detection, site monitoring, and workflow analysis. Popularity metrics based on performance on our proprietary platform are also included.

    1. Unique Sourcing Capabilities: images are collected through a proprietary gamified platform, with competitions focused on industrial, construction, and labor themes. Custom datasets can be generated within 72 hours to target specific scenarios, such as building types, stages (excavation, framing, finishing), regions, or safety compliance visuals.

    2. Global Diversity: sourced from contributors in over 100 countries, the dataset reflects a wide range of construction practices, materials, climates, and regulatory environments. It includes residential, commercial, industrial, and infrastructure projects from both urban and rural areas.

    3. High-Quality Imagery: includes a mix of wide-angle site overviews, close-ups of tools and equipment, drone shots, and candid human activity. Resolution varies from standard to ultra-high-definition, supporting both macro and contextual analysis.

    4. Popularity Scores: each image is assigned a popularity score based on its performance in GuruShots competitions. These scores provide insight into visual clarity, engagement value, and human interest—useful for safety-focused or user-facing AI models.

    5. AI-Ready Design: this dataset is structured for training models in real-time object detection (e.g., helmets, machinery), construction progress tracking, material identification, and safety compliance. It’s compatible with standard ML frameworks used in construction tech.

    6. Licensing & Compliance: fully compliant with privacy, labor, and workplace imagery regulations. Licensing is transparent and ready for commercial or research deployment.

    Use Cases: 1. Training AI for safety compliance monitoring and PPE detection. 2. Powering progress tracking and material usage analysis tools. 3. Supporting site mapping, autonomous machinery, and smart construction platforms. 4. Enhancing augmented reality overlays and digital twin models for construction planning.

    This dataset provides a comprehensive, real-world foundation for AI innovation in construction technology, safety, and operational efficiency. Custom datasets are available on request. Contact us to learn more!

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Shahzad Aslam (2025). AI Developer Performance Dataset [Dataset]. https://www.kaggle.com/datasets/zeesolver/ai-developer-dataset
Organization logo

AI Developer Performance Dataset

A Study of Coding Habits, Productivity, and AI Usage

Explore at:
zip(5992 bytes)Available download formats
Dataset updated
May 27, 2025
Authors
Shahzad Aslam
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Context

This dataset contains 500 records and 9 features related to the productivity of developers using AI tools. It appears to analyze how factors like working habits, caffeine intake, and AI usage affect developer performance.

Suggested Machine Learning Tasks

  • Binary classification (task_success)
  • Regression (e.g., predicting cognitive_load)
  • Clustering of work patterns
  • Correlation analysis & feature importance
  • Time series simulation & rolling averages (useful with synthetic date column)
  • Exploratory Data Analysis (EDA)
  • Anomaly detection (e.g., outliers in bugs_reported)
  • Multi-output regression (predicting commits and bugs_reported)
  • Dimensionality reduction (PCA or t-SNE for pattern visualization)
  • Decision rule extraction (e.g., tree-based rules for task_success) # 🧠 Inspiration Developers with balanced AI usage, sleep, and moderate coffee intake show higher task success. Overuse of AI or caffeine increases cognitive load, reducing effectiveness. Productivity thrives on smart work, not just hard work.

📊 Column Descriptions

  • hours_coding – Daily coding hours (float).
  • coffee_intake_mg – Daily caffeine intake in milligrams (integer).
  • distractions – Number of distractions experienced (integer).
  • sleep_hours – Average sleep hours per day (float).
  • commits– Number of code commits per day (integer).
  • bugs_reported – Number of bugs reported (integer).
  • ai_usage_hours – Daily AI tool usage hours (float).
  • cognitive_load – Measured cognitive load on a scale (float).
  • task_success – Binary variable indicating task completion success (1 = success, 0 = fail).
Search
Clear search
Close search
Google apps
Main menu