Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset supports a literature mapping of AI-driven content generation, analyzing 631 solutions published over the last five years to better understand and characterize the Generative Artificial Intelligence landscape. Tools like ChatGPT, Dall-E, or Midjourney have democratized access to Large Language Models, enabling the creation of human-like content. However, the concept 'Generative Artificial Intelligence' lacks a universally accepted definition, leading to potential misunderstandings.
The study has been published in International Journal of Interactive Multimedia and Artificial Intelligence.
García-Peñalvo, F. J., & Vázquez-Ingelmo, A. (2023). What do we mean by GenAI? A systematic mapping of the evolution, trends, and techniques involved in Generative AI. International Journal of Interactive Multimedia and Artificial Intelligence, In Press.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset originates from a multi-year enterprise survey conducted across industries and countries. It focuses on the organizational effects of adopting Generative AI tools such as ChatGPT, Claude, Gemini, Mixtral, LLaMA, and Groq. The dataset captures detailed metrics on job role creation, workforce transformation, productivity changes, and employee sentiment.
columns = [
"Company Name", # Anonymized name
"Industry", # Sector (e.g., Finance, Healthcare)
"Country", # Country of operation
"GenAI Tool", # GenAI platform used
"Adoption Year", # Year of initial deployment (2022–2024)
"Number of Employees Impacted", # Affected staff count
"New Roles Created", # Number of AI-driven job roles introduced
"Training Hours Provided", # Upskilling time investment
"Productivity Change (%)", # % shift in reported productivity
"Employee Sentiment" # Textual feedback from employees
]
import pandas as pd
df = pd.read_csv("Large_Enterprise_GenAI_Adoption_Impact.csv")
df.shape
df.head(10)
df.describe()
df["GenAI Tool"].value_counts()
df["Industry"].unique()
df[(df["Adoption Year"] == 2023) & (df["Country"] == "India")]
df.groupby("Industry")["Productivity Change (%)"].mean().sort_values(ascending=False).head()
from collections import Counter
import re
text = " ".join(df["Employee Sentiment"].dropna().tolist())
words = re.findall(r'\b\w+\b', text.lower())
common_words = Counter(words).most_common(20)
print(common_words)
df["Sentiment Length"] = df["Employee Sentiment"].apply(lambda x: len(x.split()))
df["Sentiment Length"].hist(bins=50)
df.groupby("GenAI Tool")["New Roles Created"].mean().sort_values(ascending=False)
df.groupby("Industry")["Training Hours Provided"].mean().sort_values(ascending=False)
Facebook
Twitterhttps://digital.csic.es/handle/10261/389309https://digital.csic.es/handle/10261/389309
This dataset was created by gathering human-authored corpora from several public health sites and generating additional data via three different LLMs: GPT-4o, Mistral-7B and Llama3-1. We included texts in English, Spanish, German and French data from the biomedical domain. The current version gathers 50% AI-generated and 50% human-written texts. The following are the data we used:
Cochrane Library: This is a database of meta-analyses and systematic reviews of updated results of clinical studies. We used abstracts of systematic reviews in all four languages.
European Clinical Trials (EUCT): This agency that supervises and evaluates pharmaceutical products of the European Union (EU). We downloaded parallel data from public assessment reports (EPARs) from 12 new medicinal products, and data from clinical trial protocols and eligibility criteria. We ensured the data were published only from January 2025 to date. The goal was gathering data that might not have been used to train the LLMs in our experiments.
European Medicines Agency (EMA): This agency that supervises and evaluates pharmaceutical products of the European Union (EU). We downloaded parallel data from public assessment reports (EPARs) from 12 new medicinal products, and data from clinical trial protocols and eligibility criteria. We ensured the data were published only from January 2025 to date. The goal was gathering data that might not have been used to train the LLMs in our experiments.
European Food Safety Authority (EFSA): This website provides a comprehensive range of data about food consumption and chemical/biological monitoring data. We chose only the topics we deem necessary for our goals, therefore including a total of 51 topics. Processing: we manually split articles with a wordcount of above 1350 and manually ensured their correctness and alignment in all languages.
European Vaccination Information Portal (EVIP): it provides up-to-date information on vaccines and vaccination. The factsheets are available in all languages, and consist of 20 texts each.
Immunize: Immunize.org (formerly known as the Immunization Action Coalition) is a U.S.-based organization dedicated to providing comprehensive immunization resources for healthcare professionals and the public. Vaccine Information Sheets (VISs) have been translated into several languages, but not all of them contain all VISs. They are given as PDFs, with 25 in Spanish, French and English, but only 21 in German. Only PDFs overlapping in all languages were used.
Migration und Gesundheit - German Ministry of Health (BFG): This portal provides multilingual health information tailored for migrants and refugees. Gesundheit für alle is a PDF file that provides a guide to the German healthcare system, and it is available in Spanish, English and German. Processing: Two topics, which were shorter than 100 words, were merged with the next one to ensure that context is preserved.
Orphadata (INSERM): a comprehensive knowledge base about rare diseases and orphan drugs, in re-usable and high-quality formats, released in 12 official EU languages. We gathered definitions, signs and symptoms and phenotypes about 4389 rare diseases in English, German, Spanish and French. Processing: Since each definition is roughly the same size and similar format, we simply group 5 definitions together to make the text per topic longer.
PubMed (National Library of Medicine): we downloaded abstracts available in English, Spanish, French and German.
Wikipedia: a free, web-based, collaborative multilingual encyclopedia project; we selected (bio)medical contents available in English, German, Spanish and French. To ensure that the texts were not automatically generated, we only use articles that date back to before the release of ChatGPT, i.e. before 30th November 2022. Processing: some data cleaning was necessary; we also removed all topics with less than 5 words, or split those with more than 9 sentences into equally long parts. From these split up files, we make sure that they contain a minimum of 100 words, and we take only those contents or topics that exist in all three languages.
[Description of methods used for collection/generation of data] The corpus statistics and methods are explained in the following article: Patrick Styll, Leonardo Campillos-Llanos, Jorge Fernández-García, Isabel Segura-Bedmar (2025) "MedAID-ML: A Multilingual Dataset of Biomedical Texts for Detecting AI-Generated Content". Under review.
[Methods for processing the data] - Web-scraping of data from HTML content and PDF files available on the websites of the health contents. - Postprocessing and cleaning of data (e.g., removal of redundant white spaces or line breaks), and homogeneization of text length. - Generation of corresponding contents by means of generative AI using three large language models: GPT-4o, Mistral-7B and Llama3-1. - Formating of contents into JSON format.
[Files] 1) JSON files: These are separated in TRAIN and TEST. Each file has a list of hashes for each text, and each hash contains the following fields: • text: the textual content. • data_source: the source repository of the text. • filename: the name of the original file from which the data were sourced. • source: label indicating if it is a human-written text (HUMAN) or the LLM used to generate the text ("gpt4o", "mistral" or "llama"). • "language": The language code of the text: German ("de"), English ("en"), Spanish ("es") or French ("fr"). • "target": a binary label to code if the text is written by humans ("0") or AI ("1"). • "ratio": The proportion of the text that was created with AI: "0.5" for AI-generated texts, and "null" for human texts.
The corpus is made up of 13292 comparable and parallel texts in four languages: German, English, Spanish and French. The total token count is 3795449 tokens. This resource is aimed at training and evaluating models to detect medical texts created by means of generative artificial intelligence.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains 51,023 headword-synonym-distractor triplets for 5,000 headwords. Distractor is defined as an incorrect answer/alternative to synonym, which can be similar to synonym in meaning and/or form. Headwords and their synonyms were obtained from the Thesaurus of Modern Slovene (http://hdl.handle.net/11356/1916), which is part of the Dictionary Database of Slovene (the database is available via API: https://wiki.cjvt.si/books/digital-dictionary-database-of-slovene). The criteria for selecting the headwords (nouns, adjectives, verbs, and adverbs) were that they had to be frequent and had to have several synonyms, preferably more than five.
The distractors were obtained with the Gemini-2.0-flash (https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-0-flash) model, using the following prompt: "You are given headword and a synonym. Create a distractor — a word that looks similar to the synonym but has a different meaning. The distractor must be the same part of speech as the synonym (e.g., if the synonyms are verbs in their base form, the distractor must also be a verb in its base form). The distractor must not include sensitive vocabulary (e.g., words related to minorities, religion, sexual content, violence, etc.). The distractor must be a frequent word in the Slovene language. The distractor must look similar to the synonym but have a different meaning. Write the distractor in the same line as the headword and synonym, following this format: živahen - vesel - resen. These are the headword and synonym: {word} - {synonym} The distractor cannot be one of these words: {synonym_set}."
The manual evaluation of all the distractors (with the exception of the distractors that were identified as existing synonyms in the Thesaurus) was conducted by two lexicographers. Each of them evaluted their own part, with the second one also subsequently inspecting the evaluations of the first one. The estimate is that around 30-35% of data was evaluated by both lexicographers. Five decisions were used: good distractor, bad distractor, problematic (i.e. difficult to decide due to certain characteristic such as being too similar to synonym, word being too archaic or informal etc.), same as synonym, and synonym candidate (likely being a legitimate (new) synonym of the headword).
The dataset also includes the information on the frequency of synonyms and the distractors in the Gigafida 2.0 reference corpus of Slovene (http://hdl.handle.net/11356/1320). The frequency information is provided for single-word lemmas only (and not for multiword items, non-lemma single-word forms such as plural form of nouns or comparatives of adjectives). In addition, the information on similarity between the headwords and synonyms, and between the synonyms and distractors is provided. Similary is calculated using Gestalt pattern matching.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Accurate and interpretable solar power forecasting is critical for effectively integrating Photo-Voltaic (PV) systems into modern energy infrastructure. This paper introduces a novel two-stage hybrid framework that couples deep learning-based time series prediction with generative Large Language Models (LLMs) to enhance forecast accuracy and model interpretability. At its core, the proposed SolarTrans model leverages a lightweight Transformer-based encoder-decoder architecture tailored for short-term DC power prediction using multivariate inverter and weather data, including irradiance, ambient and module temperatures, and temporal features. Experiments conducted on publicly available datasets from two PV plants over 34 days demonstrate strong predictive performance. The SolarTrans model achieves a Mean Absolute Error (MAE) of 0.0782 and 0.1544, Root Mean Squared Error (RMSE) of 0.1760 and 0.4424, and R2 scores of 0.9692 and 0.7956 on Plant 1 and Plant 2, respectively. On the combined dataset, the model yields an MAE of 0.1105, RMSE of 0.3189, and R2 of 0.8967. To address the interpretability challenge, we fine-tuned the Flan-T5 model on structured prompts derived from domain-informed templates and forecast outputs. The resulting explanation module achieves ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-Lsum scores of 0.7889, 0.7211, 0.7759, and 0.7771, respectively, along with a BLEU score of 0.6558, indicating high-fidelity generation of domain-relevant natural language explanations.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Like science itself, our understanding of chemical concepts and the way we teach them change over time. This paper explores historical and modern perspectives of the concept of valence in the context of collegiate general chemistry and draws comparisons to responses from generative artificial intelligence (genAI) tools such as ChatGPT. A fundamental concept in chemistry, valence in the early and mid-20th century was primarily defined as the “combining capacity” of atoms. Twenty-first century textbooks do not include this historical definition but rather use valence as an adjective to modify other nouns, e.g., valence electron or valence orbital. To explore these different perspectives in other information sources that could be used by students, we used a systematic series of prompts about valence to analyze the responses from ChatGPT, Bard, Liner, and ChatSonic from September and December 2023. Our findings show the historical definition is very common in responses to prompts which use valence or valency as a noun but less common when prompts include valence as an adjective. Regarding this concept, the state-of-the-art genAI tools are more consistent with textbooks from the 1950s than modern collegiate general chemistry textbooks. These findings present an opportunity for chemistry educators to observe and discuss with students the nature of science and how our understanding of chemistry changes. Including implications for educators, we present an example activity that may be deployed in general chemistry classes.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Presented are propensity score mean-squared-error and standardised ration of propensity score mean-squared error.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This study aims to enhance the recommendation system’s capability in addressing cold start issues, semantic understanding, and modeling the diversity of user interests. The study proposes a movie recommendation algorithm framework that integrates Knowledge Graph Embedding via Dynamic Mapping Matrix (TransD) and Artificial Intelligence Generated Content (AIGC)-based generative semantic modeling. This framework is designed to overcome existing challenges in recommendation algorithms, including insufficient user interest representation, inadequate knowledge graph relationship modeling, and limited diversity in recommended content. Traditional recommendation models face three key limitations, including coarse-grained user profiling, reliance on manually generated tags, and inadequate exploitation of structured information. To address these challenges, this study employs the TransD model for dynamic semantic modeling of heterogeneous entities and their complex relationships. Additionally, AIGC technology is employed to automatically extract latent interest dimensions, emotional characteristics, and semantic tags from user reviews, thereby constructing a high-dimensional user interest profile and a content tag completion system. Experiments are conducted using the MovieLens 100K, 1M, and 10M public datasets, with evaluation metrics including Mean Average Precision (MAP), user satisfaction scores, content coverage, click-through rate (CTR), and recommendation trust scores. The results demonstrate that the optimized model achieves hit rates of 0.878, 0.878, and 0.798, and MAP scores of 0.633, 0.637, and 0.574 across the three datasets. The user satisfaction scores are 0.89, 0.88, and 0.87, while the CTR values reach 0.35, 0.33, and 0.34, all of which significantly outperform traditional models. Notably, the proposed approach exhibits superior stability and semantic adaptability, particularly in cold start user scenarios and interest transition contexts. Therefore, this study provides a novel modeling approach that integrates structured and unstructured information for movie recommendation systems. Also, it contributes both theoretically and practically to the research fields of intelligent recommendation systems, knowledge graph embedding, and AIGC-based hybrid modeling.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset supports a literature mapping of AI-driven content generation, analyzing 631 solutions published over the last five years to better understand and characterize the Generative Artificial Intelligence landscape. Tools like ChatGPT, Dall-E, or Midjourney have democratized access to Large Language Models, enabling the creation of human-like content. However, the concept 'Generative Artificial Intelligence' lacks a universally accepted definition, leading to potential misunderstandings.
The study has been published in International Journal of Interactive Multimedia and Artificial Intelligence.
García-Peñalvo, F. J., & Vázquez-Ingelmo, A. (2023). What do we mean by GenAI? A systematic mapping of the evolution, trends, and techniques involved in Generative AI. International Journal of Interactive Multimedia and Artificial Intelligence, In Press.