8 datasets found
  1. Z

    Data from: Dataset for the mapping study "What do we mean by GenAI?"

    • data.niaid.nih.gov
    • produccioncientifica.usal.es
    • +1more
    Updated Jul 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vázquez-Ingelmo, A.; García-Peñalvo, F. J. (2023). Dataset for the mapping study "What do we mean by GenAI?" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8162483
    Explore at:
    Dataset updated
    Jul 20, 2023
    Dataset provided by
    Universidad de Salamanca
    Authors
    Vázquez-Ingelmo, A.; García-Peñalvo, F. J.
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset supports a literature mapping of AI-driven content generation, analyzing 631 solutions published over the last five years to better understand and characterize the Generative Artificial Intelligence landscape. Tools like ChatGPT, Dall-E, or Midjourney have democratized access to Large Language Models, enabling the creation of human-like content. However, the concept 'Generative Artificial Intelligence' lacks a universally accepted definition, leading to potential misunderstandings.

    The study has been published in International Journal of Interactive Multimedia and Artificial Intelligence.

    García-Peñalvo, F. J., & Vázquez-Ingelmo, A. (2023). What do we mean by GenAI? A systematic mapping of the evolution, trends, and techniques involved in Generative AI. International Journal of Interactive Multimedia and Artificial Intelligence, In Press.

  2. Enterprise GenAI Adoption & Workforce Impact Data

    • kaggle.com
    zip
    Updated Jun 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rishi (2025). Enterprise GenAI Adoption & Workforce Impact Data [Dataset]. https://www.kaggle.com/datasets/tfisthis/enterprise-genai-adoption-and-workforce-impact-data/discussion?sort=undefined
    Explore at:
    zip(3081470 bytes)Available download formats
    Dataset updated
    Jun 12, 2025
    Authors
    Rishi
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Enterprise GenAI Adoption & Workforce Impact Dataset (100K+ Rows)

    This dataset originates from a multi-year enterprise survey conducted across industries and countries. It focuses on the organizational effects of adopting Generative AI tools such as ChatGPT, Claude, Gemini, Mixtral, LLaMA, and Groq. The dataset captures detailed metrics on job role creation, workforce transformation, productivity changes, and employee sentiment.

    Data Schema

    columns = [
      "Company Name",           # Anonymized name
      "Industry",             # Sector (e.g., Finance, Healthcare)
      "Country",              # Country of operation
      "GenAI Tool",            # GenAI platform used
      "Adoption Year",           # Year of initial deployment (2022–2024)
      "Number of Employees Impacted",   # Affected staff count
      "New Roles Created",        # Number of AI-driven job roles introduced
      "Training Hours Provided",     # Upskilling time investment
      "Productivity Change (%)",     # % shift in reported productivity
      "Employee Sentiment"        # Textual feedback from employees
    ]
    

    Load the Dataset

    import pandas as pd
    
    df = pd.read_csv("Large_Enterprise_GenAI_Adoption_Impact.csv")
    df.shape
    

    Basic Exploration

    df.head(10)
    df.describe()
    df["GenAI Tool"].value_counts()
    df["Industry"].unique()
    

    Filter Examples

    Filter by Year and Country

    df[(df["Adoption Year"] == 2023) & (df["Country"] == "India")]
    

    Get Top 5 Industries by Productivity Gain

    df.groupby("Industry")["Productivity Change (%)"].mean().sort_values(ascending=False).head()
    

    Text Analysis on Employee Sentiment

    Word Frequency Analysis

    from collections import Counter
    import re
    
    text = " ".join(df["Employee Sentiment"].dropna().tolist())
    words = re.findall(r'\b\w+\b', text.lower())
    common_words = Counter(words).most_common(20)
    print(common_words)
    

    Sentiment Length Distribution

    df["Sentiment Length"] = df["Employee Sentiment"].apply(lambda x: len(x.split()))
    df["Sentiment Length"].hist(bins=50)
    

    Group-Based Insights

    Role Creation by Tool

    df.groupby("GenAI Tool")["New Roles Created"].mean().sort_values(ascending=False)
    

    Training Hours by Industry

    df.groupby("Industry")["Training Hours Provided"].mean().sort_values(ascending=False)
    

    Sample Use Cases

    • Evaluate GenAI adoption patterns by sector or region
    • Analyze workforce upskilling initiatives and investments
    • Explore employee reactions to AI integration using NLP
    • Build models to predict productivity impact based on tool, industry, or country
    • Study role creation trends to anticipate future AI-based job market shifts
  3. Medical Artificial Intelligence text Detection in Multilingual settings...

    • datos.cchs.csic.es
    json, txt
    Updated Nov 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CSIC (2025). Medical Artificial Intelligence text Detection in Multilingual settings (MedAID-ML) - Datos abiertos CCHS [Dataset]. https://datos.cchs.csic.es/en/dataset/ade96985-70e0-41d8-b69c-003013a24503
    Explore at:
    json, txtAvailable download formats
    Dataset updated
    Nov 7, 2025
    Dataset provided by
    Spanish National Research Councilhttp://www.csic.es/
    Authors
    CSIC
    License

    https://digital.csic.es/handle/10261/389309https://digital.csic.es/handle/10261/389309

    Description

    This dataset was created by gathering human-authored corpora from several public health sites and generating additional data via three different LLMs: GPT-4o, Mistral-7B and Llama3-1. We included texts in English, Spanish, German and French data from the biomedical domain. The current version gathers 50% AI-generated and 50% human-written texts. The following are the data we used:

    • Cochrane Library: This is a database of meta-analyses and systematic reviews of updated results of clinical studies. We used abstracts of systematic reviews in all four languages.

    • European Clinical Trials (EUCT): This agency that supervises and evaluates pharmaceutical products of the European Union (EU). We downloaded parallel data from public assessment reports (EPARs) from 12 new medicinal products, and data from clinical trial protocols and eligibility criteria. We ensured the data were published only from January 2025 to date. The goal was gathering data that might not have been used to train the LLMs in our experiments.

    • European Medicines Agency (EMA): This agency that supervises and evaluates pharmaceutical products of the European Union (EU). We downloaded parallel data from public assessment reports (EPARs) from 12 new medicinal products, and data from clinical trial protocols and eligibility criteria. We ensured the data were published only from January 2025 to date. The goal was gathering data that might not have been used to train the LLMs in our experiments.

    • European Food Safety Authority (EFSA): This website provides a comprehensive range of data about food consumption and chemical/biological monitoring data. We chose only the topics we deem necessary for our goals, therefore including a total of 51 topics. Processing: we manually split articles with a wordcount of above 1350 and manually ensured their correctness and alignment in all languages.

    • European Vaccination Information Portal (EVIP): it provides up-to-date information on vaccines and vaccination. The factsheets are available in all languages, and consist of 20 texts each.

    • Immunize: Immunize.org (formerly known as the Immunization Action Coalition) is a U.S.-based organization dedicated to providing comprehensive immunization resources for healthcare professionals and the public. Vaccine Information Sheets (VISs) have been translated into several languages, but not all of them contain all VISs. They are given as PDFs, with 25 in Spanish, French and English, but only 21 in German. Only PDFs overlapping in all languages were used.

    • Migration und Gesundheit - German Ministry of Health (BFG): This portal provides multilingual health information tailored for migrants and refugees. Gesundheit für alle is a PDF file that provides a guide to the German healthcare system, and it is available in Spanish, English and German. Processing: Two topics, which were shorter than 100 words, were merged with the next one to ensure that context is preserved.

    • Orphadata (INSERM): a comprehensive knowledge base about rare diseases and orphan drugs, in re-usable and high-quality formats, released in 12 official EU languages. We gathered definitions, signs and symptoms and phenotypes about 4389 rare diseases in English, German, Spanish and French. Processing: Since each definition is roughly the same size and similar format, we simply group 5 definitions together to make the text per topic longer.

    • PubMed (National Library of Medicine): we downloaded abstracts available in English, Spanish, French and German.

    • Wikipedia: a free, web-based, collaborative multilingual encyclopedia project; we selected (bio)medical contents available in English, German, Spanish and French. To ensure that the texts were not automatically generated, we only use articles that date back to before the release of ChatGPT, i.e. before 30th November 2022. Processing: some data cleaning was necessary; we also removed all topics with less than 5 words, or split those with more than 9 sentences into equally long parts. From these split up files, we make sure that they contain a minimum of 100 words, and we take only those contents or topics that exist in all three languages.

    [Description of methods used for collection/generation of data] The corpus statistics and methods are explained in the following article: Patrick Styll, Leonardo Campillos-Llanos, Jorge Fernández-García, Isabel Segura-Bedmar (2025) "MedAID-ML: A Multilingual Dataset of Biomedical Texts for Detecting AI-Generated Content". Under review.

    [Methods for processing the data] - Web-scraping of data from HTML content and PDF files available on the websites of the health contents. - Postprocessing and cleaning of data (e.g., removal of redundant white spaces or line breaks), and homogeneization of text length. - Generation of corresponding contents by means of generative AI using three large language models: GPT-4o, Mistral-7B and Llama3-1. - Formating of contents into JSON format.

    [Files] 1) JSON files: These are separated in TRAIN and TEST. Each file has a list of hashes for each text, and each hash contains the following fields: • text: the textual content. • data_source: the source repository of the text. • filename: the name of the original file from which the data were sourced. • source: label indicating if it is a human-written text (HUMAN) or the LLM used to generate the text ("gpt4o", "mistral" or "llama"). • "language": The language code of the text: German ("de"), English ("en"), Spanish ("es") or French ("fr"). • "target": a binary label to code if the text is written by humans ("0") or AI ("1"). • "ratio": The proportion of the text that was created with AI: "0.5" for AI-generated texts, and "null" for human texts.

    The corpus is made up of 13292 comparable and parallel texts in four languages: German, English, Spanish and French. The total token count is 3795449 tokens. This resource is aimed at training and evaluating models to detect medical texts created by means of generative artificial intelligence.

  4. E

    Data from: Dataset of annotated headword-synonym-distractor triplets SYNDIST...

    • live.european-language-grid.eu
    binary format
    Updated Nov 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Dataset of annotated headword-synonym-distractor triplets SYNDIST [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/23910
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Nov 9, 2025
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains 51,023 headword-synonym-distractor triplets for 5,000 headwords. Distractor is defined as an incorrect answer/alternative to synonym, which can be similar to synonym in meaning and/or form. Headwords and their synonyms were obtained from the Thesaurus of Modern Slovene (http://hdl.handle.net/11356/1916), which is part of the Dictionary Database of Slovene (the database is available via API: https://wiki.cjvt.si/books/digital-dictionary-database-of-slovene). The criteria for selecting the headwords (nouns, adjectives, verbs, and adverbs) were that they had to be frequent and had to have several synonyms, preferably more than five.

    The distractors were obtained with the Gemini-2.0-flash (https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-0-flash) model, using the following prompt: "You are given headword and a synonym. Create a distractor — a word that looks similar to the synonym but has a different meaning. The distractor must be the same part of speech as the synonym (e.g., if the synonyms are verbs in their base form, the distractor must also be a verb in its base form). The distractor must not include sensitive vocabulary (e.g., words related to minorities, religion, sexual content, violence, etc.). The distractor must be a frequent word in the Slovene language. The distractor must look similar to the synonym but have a different meaning. Write the distractor in the same line as the headword and synonym, following this format: živahen - vesel - resen. These are the headword and synonym: {word} - {synonym} The distractor cannot be one of these words: {synonym_set}."

    The manual evaluation of all the distractors (with the exception of the distractors that were identified as existing synonyms in the Thesaurus) was conducted by two lexicographers. Each of them evaluted their own part, with the second one also subsequently inspecting the evaluations of the first one. The estimate is that around 30-35% of data was evaluated by both lexicographers. Five decisions were used: good distractor, bad distractor, problematic (i.e. difficult to decide due to certain characteristic such as being too similar to synonym, word being too archaic or informal etc.), same as synonym, and synonym candidate (likely being a legitimate (new) synonym of the headword).

    The dataset also includes the information on the frequency of synonyms and the distractors in the Gigafida 2.0 reference corpus of Slovene (http://hdl.handle.net/11356/1320). The frequency information is provided for single-word lemmas only (and not for multiword items, non-lemma single-word forms such as plural form of nouns or comparatives of adjectives). In addition, the information on similarity between the headwords and synonyms, and between the synonyms and distractors is provided. Similary is calculated using Gestalt pattern matching.

  5. Hyperparameters of the proposed model.

    • plos.figshare.com
    xls
    Updated Sep 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayesha Siddiqa; Nadim Rana; Wazir Zada Khan; Fathe Jeribi; Ali Tahir (2025). Hyperparameters of the proposed model. [Dataset]. http://doi.org/10.1371/journal.pone.0331516.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 17, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Ayesha Siddiqa; Nadim Rana; Wazir Zada Khan; Fathe Jeribi; Ali Tahir
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Accurate and interpretable solar power forecasting is critical for effectively integrating Photo-Voltaic (PV) systems into modern energy infrastructure. This paper introduces a novel two-stage hybrid framework that couples deep learning-based time series prediction with generative Large Language Models (LLMs) to enhance forecast accuracy and model interpretability. At its core, the proposed SolarTrans model leverages a lightweight Transformer-based encoder-decoder architecture tailored for short-term DC power prediction using multivariate inverter and weather data, including irradiance, ambient and module temperatures, and temporal features. Experiments conducted on publicly available datasets from two PV plants over 34 days demonstrate strong predictive performance. The SolarTrans model achieves a Mean Absolute Error (MAE) of 0.0782 and 0.1544, Root Mean Squared Error (RMSE) of 0.1760 and 0.4424, and R2 scores of 0.9692 and 0.7956 on Plant 1 and Plant 2, respectively. On the combined dataset, the model yields an MAE of 0.1105, RMSE of 0.3189, and R2 of 0.8967. To address the interpretability challenge, we fine-tuned the Flan-T5 model on structured prompts derived from domain-informed templates and forecast outputs. The resulting explanation module achieves ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-Lsum scores of 0.7889, 0.7211, 0.7759, and 0.7771, respectively, along with a BLEU score of 0.6558, indicating high-fidelity generation of domain-relevant natural language explanations.

  6. f

    Data from: Exploring the Concept of Valence and the Nature of Science via...

    • figshare.com
    • acs.figshare.com
    xlsx
    Updated Jul 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rebecca M. Jones; Eva-Maria Rudler; Conner Preston (2024). Exploring the Concept of Valence and the Nature of Science via Generative Artificial Intelligence and General Chemistry Textbooks [Dataset]. http://doi.org/10.1021/acs.jchemed.4c00271.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jul 29, 2024
    Dataset provided by
    ACS Publications
    Authors
    Rebecca M. Jones; Eva-Maria Rudler; Conner Preston
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Like science itself, our understanding of chemical concepts and the way we teach them change over time. This paper explores historical and modern perspectives of the concept of valence in the context of collegiate general chemistry and draws comparisons to responses from generative artificial intelligence (genAI) tools such as ChatGPT. A fundamental concept in chemistry, valence in the early and mid-20th century was primarily defined as the “combining capacity” of atoms. Twenty-first century textbooks do not include this historical definition but rather use valence as an adjective to modify other nouns, e.g., valence electron or valence orbital. To explore these different perspectives in other information sources that could be used by students, we used a systematic series of prompts about valence to analyze the responses from ChatGPT, Bard, Liner, and ChatSonic from September and December 2023. Our findings show the historical definition is very common in responses to prompts which use valence or valency as a noun but less common when prompts include valence as an adjective. Regarding this concept, the state-of-the-art genAI tools are more consistent with textbooks from the 1950s than modern collegiate general chemistry textbooks. These findings present an opportunity for chemistry educators to observe and discuss with students the nature of science and how our understanding of chemistry changes. Including implications for educators, we present an example activity that may be deployed in general chemistry classes.

  7. Statistical analysis comparing synthetic data tables to the real training...

    • plos.figshare.com
    xls
    Updated Jun 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anmol Arora; Ananya Arora (2023). Statistical analysis comparing synthetic data tables to the real training dataset (n = 2408). [Dataset]. http://doi.org/10.1371/journal.pone.0283094.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 20, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Anmol Arora; Ananya Arora
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Presented are propensity score mean-squared-error and standardised ration of propensity score mean-squared error.

  8. f

    Data from: Parameter settings.

    • figshare.com
    xls
    Updated Nov 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yang Gao; Zhiqun Lin (2025). Parameter settings. [Dataset]. http://doi.org/10.1371/journal.pone.0333607.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 11, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Yang Gao; Zhiqun Lin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This study aims to enhance the recommendation system’s capability in addressing cold start issues, semantic understanding, and modeling the diversity of user interests. The study proposes a movie recommendation algorithm framework that integrates Knowledge Graph Embedding via Dynamic Mapping Matrix (TransD) and Artificial Intelligence Generated Content (AIGC)-based generative semantic modeling. This framework is designed to overcome existing challenges in recommendation algorithms, including insufficient user interest representation, inadequate knowledge graph relationship modeling, and limited diversity in recommended content. Traditional recommendation models face three key limitations, including coarse-grained user profiling, reliance on manually generated tags, and inadequate exploitation of structured information. To address these challenges, this study employs the TransD model for dynamic semantic modeling of heterogeneous entities and their complex relationships. Additionally, AIGC technology is employed to automatically extract latent interest dimensions, emotional characteristics, and semantic tags from user reviews, thereby constructing a high-dimensional user interest profile and a content tag completion system. Experiments are conducted using the MovieLens 100K, 1M, and 10M public datasets, with evaluation metrics including Mean Average Precision (MAP), user satisfaction scores, content coverage, click-through rate (CTR), and recommendation trust scores. The results demonstrate that the optimized model achieves hit rates of 0.878, 0.878, and 0.798, and MAP scores of 0.633, 0.637, and 0.574 across the three datasets. The user satisfaction scores are 0.89, 0.88, and 0.87, while the CTR values reach 0.35, 0.33, and 0.34, all of which significantly outperform traditional models. Notably, the proposed approach exhibits superior stability and semantic adaptability, particularly in cold start user scenarios and interest transition contexts. Therefore, this study provides a novel modeling approach that integrates structured and unstructured information for movie recommendation systems. Also, it contributes both theoretically and practically to the research fields of intelligent recommendation systems, knowledge graph embedding, and AIGC-based hybrid modeling.

  9. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Vázquez-Ingelmo, A.; García-Peñalvo, F. J. (2023). Dataset for the mapping study "What do we mean by GenAI?" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8162483

Data from: Dataset for the mapping study "What do we mean by GenAI?"

Related Article
Explore at:
Dataset updated
Jul 20, 2023
Dataset provided by
Universidad de Salamanca
Authors
Vázquez-Ingelmo, A.; García-Peñalvo, F. J.
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset supports a literature mapping of AI-driven content generation, analyzing 631 solutions published over the last five years to better understand and characterize the Generative Artificial Intelligence landscape. Tools like ChatGPT, Dall-E, or Midjourney have democratized access to Large Language Models, enabling the creation of human-like content. However, the concept 'Generative Artificial Intelligence' lacks a universally accepted definition, leading to potential misunderstandings.

The study has been published in International Journal of Interactive Multimedia and Artificial Intelligence.

García-Peñalvo, F. J., & Vázquez-Ingelmo, A. (2023). What do we mean by GenAI? A systematic mapping of the evolution, trends, and techniques involved in Generative AI. International Journal of Interactive Multimedia and Artificial Intelligence, In Press.

Search
Clear search
Close search
Google apps
Main menu