Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset takes the original data from the following contributions:
https://www.kaggle.com/datasets/radek1/llm-generated-essays https://www.kaggle.com/datasets/alejopaullier/argugpt https://www.kaggle.com/datasets/nbroad/daigt-data-llama-70b-and-falcon180b https://www.kaggle.com/datasets/thedrcat/daigt-proper-train-dataset https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset https://www.kaggle.com/datasets/darraghdog/hello-claude-1000-essays-from-anthropic https://www.kaggle.com/datasets/carlmcbrideellis/llm-7-prompt-training-dataset https://www.kaggle.com/datasets/nbroad/persaude-corpus-2
Some of those already compile the others inside them, so I first removed the duplicates comparing by full text. After that the data augmentation took place, with a process composed of 2 steps that were iterated over and over, first I tried correcting typos on the texts by using language_tool_python, then I introduced noise the way the organizators seem to have done it (see https://www.kaggle.com/competitions/llm-detect-ai-generated-text/discussion/452279), then I corrected typos again, and repeat. After repeating these steps a couple of times I removed duplicates again comparing by full text.
The result is this dataset, it's split in train and test because I wanted to prevent information leaking between the train and test sections, so I did the steps independently on each of them (I split them before doing the data augmentation). If you don't care about train and test you can just concatenate both into a single dataset for training purposes.
If you find this dataset helpful, please upvote.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains 1,165 rows, each corresponding to a respondent (including LLM-generated respondents) in our study. It contains 21 columns. The first column, "Generating Model," specifies the model or source (e.g., "Human") that generated the responses. The remaining 20 columns (Q1 to Q20) indicate the correctness of answers to 20 college algebra questions for each respondent. "TRUE" means the respondent answered correctly, "FALSE" indicates an incorrect answer, and N/A represents missing data (i.e., no response). The dataset includes responses from seven different generating models:Human: 265 responsesGPT-4: 150 responsesGPT-3.5: 150 responsesLlama 3: 150 responsesLlama 2: 150 responsesGemini: 150 responsesCohere: 150 responses
Facebook
TwitterThis dataset was created by JISU KIM8873
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundAlzheimer’s disease and related dementias (ADRD) affect nearly five million older adults in the United States, yet more than half remain undiagnosed. Speech-based natural language processing (NLP) provides a scalable approach to identify early cognitive decline by detecting subtle linguistic markers that may precede clinical diagnosis.ObjectiveThis study aims to develop and evaluate a speech-based screening pipeline that integrates transformer-based embeddings with handcrafted linguistic features, incorporates synthetic augmentation using large language models (LLMs), and benchmarks unimodal and multimodal LLM classifiers. External validation was performed to assess generalizability to an MCI-only cohort.MethodsTranscripts were obtained from the ADReSSo 2021 benchmark dataset (n = 237; derived from the Pitt Corpus, DementiaBank) and the DementiaBank Delaware corpus (n = 205; clinically diagnosed mild cognitive impairment [MCI] vs. controls). Audio was automatically transcribed using Amazon Web Services Transcribe (general model). Ten transformer models were evaluated under three fine-tuning strategies. A late-fusion model combined embeddings from the best-performing transformer with 110 linguistically derived features. Five LLMs (LLaMA-8B/70B, MedAlpaca-7B, Ministral-8B, GPT-4o) were fine-tuned to generate label-conditioned synthetic speech for data augmentation. Three multimodal LLMs (GPT-4o, Qwen-Omni, Phi-4) were tested in zero-shot and fine-tuned settings.ResultsOn the ADReSSo dataset, the fusion model achieved an F1-score of 83.32 (AUC = 89.48), outperforming both transformer-only and linguistic-only baselines. Augmentation with MedAlpaca-7B synthetic speech improved performance to F1 = 85.65 at 2 × scale, whereas higher augmentation volumes reduced gains. Fine-tuning improved unimodal LLM classifiers (e.g., MedAlpaca-7B, F1 = 47.73 → 78.69), while multimodal models demonstrated lower performance (Phi-4 = 71.59; GPT-4o omni = 67.57). On the Delaware corpus, the pipeline generalized to an MCI-only cohort, with the fusion model plus 1 × MedAlpaca-7B augmentation achieving F1 = 72.82 (AUC = 69.57).ConclusionIntegrating transformer embeddings with handcrafted linguistic features enhances ADRD detection from speech. Distributionally aligned LLM-generated narratives provide effective but bounded augmentation, while current multimodal models remain limited. Crucially, validation on the Delaware corpus demonstrates that the proposed pipeline generalizes to early-stage impairment, supporting its potential as a scalable approach for clinically relevant early screening. All codes for LLMCARE are publicly available at: GitHub.
Facebook
Twitterhttps://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
Open-Source LLM Market Size 2025-2029
The open-source LLM market size is valued to increase by USD 54 billion, at a CAGR of 33.7% from 2024 to 2029. Increasing democratization and compelling economics will drive the open-source LLM market.
Market Insights
North America dominated the market and accounted for a 37% growth during the 2025-2029.
By Application - Technology and software segment was valued at USD 4.02 billion in 2023
By Deployment - On-premises segment accounted for the largest market revenue share in 2023
Market Size & Forecast
Market Opportunities: USD 575.60 million
Market Future Opportunities 2024: USD 53995.50 million
CAGR from 2024 to 2029 : 33.7%
Market Summary
The Open-Source Large Language Model (LLM) market has experienced significant growth due to the increasing democratization of artificial intelligence (AI) technology and its compelling economics. This global trend is driven by the proliferation of smaller organizations seeking to leverage advanced language models for various applications, including supply chain optimization, compliance, and operational efficiency. Open-source LLMs offer several advantages over proprietary models. They provide greater flexibility, as users can modify and adapt the models to their specific needs. Additionally, open-source models often have larger training datasets, leading to improved performance and accuracy. However, there are challenges to implementing open-source LLMs, such as the prohibitive computational costs and critical hardware dependency. These obstacles necessitate the development of more efficient algorithms and the exploration of cloud computing solutions.
A real-world business scenario illustrates the potential benefits of open-source LLMs. A manufacturing company aims to optimize its supply chain by implementing an AI-powered system to analyze customer demand patterns and predict inventory needs. The company chooses an open-source LLM due to its flexibility and cost-effectiveness. By integrating the LLM into its supply chain management system, the company can improve forecasting accuracy and reduce inventory costs, ultimately increasing operational efficiency and customer satisfaction. Despite the challenges, the market continues to grow as organizations recognize the potential benefits of advanced language models. The democratization of AI technology and the compelling economics of open-source solutions make them an attractive option for businesses of all sizes.
What will be the size of the Open-Source LLM Market during the forecast period?
Get Key Insights on Market Forecast (PDF) Request Free Sample
The Open-Source Large Language Model (LLM) Market continues to evolve, offering businesses innovative solutions for various applications. One notable trend is the increasing adoption of explainable AI (XAI) methods in LLMs. XAI models provide transparency into the reasoning behind their outputs, addressing concerns around bias mitigation and interpretability. This transparency is crucial for industries with stringent compliance requirements, such as finance and healthcare. For instance, a recent study reveals that companies implementing XAI models have achieved a 25% increase in model acceptance rates among stakeholders, leading to more informed decisions. This improvement can significantly impact product strategy and budgeting, as businesses can confidently invest in AI solutions that align with their ethical and regulatory standards.
Moreover, advancements in LLM architecture include encoder-decoder architectures, multi-head attention, and self-attention layers, which enhance feature extraction and model scalability. These improvements contribute to better performance and more accurate results, making LLMs an essential tool for businesses seeking to optimize their operations and gain a competitive edge. In summary, the market is characterized by continuous innovation and a strong focus on delivering human-centric solutions. The adoption of explainable AI methods and advancements in neural network architecture are just a few examples of how businesses can benefit from these technologies. By investing in Open-Source LLMs, organizations can improve efficiency, enhance decision-making, and maintain a responsible approach to AI implementation.
Unpacking the Open-Source LLM Market Landscape
In the dynamic landscape of large language models (LLMs), open-source solutions have gained significant traction, offering businesses competitive advantages through data augmentation and few-shot learning capabilities. Compared to traditional models, open-source LLMs enable a 30% reduction in optimizer selection time and a 25% improvement in model accuracy for summarization tasks. Furthermore, distributed training and model compression techniques allow businesses to process larger training dataset sizes with minimal tokenization process disruptions, result
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The repository data corresponds partially to the manuscript titled "Can LLM-Augmented Autonomous Agents Cooperate? An Evaluation of Their Cooperative Capabilities through Melting Pot," submitted to IEEE Transactions on Artificial Intelligence. The dataset comprises experiments conducted with Large Language Model-Augmented Autonomous Agents (LAAs), as implemented in the ["Cooperative Agents" repository](https://github.com/Cooperative-IA/CooperativeGPT/tree/main), using substrates from the Melting Pot framework.
This dataset is divided into two main experiment categories:
Personality_experiments:
Comparison_baselines_experiments:
These scenarios evaluate different cooperative and competitive behaviors among agents and are used to compare decision-making architectures of LAAs against reinforcement learning (RL) baselines. Unlike the Personality_experiments, these comparisons do not involve bots but exclusively analyze RL and LAA architectures.
The metrics and indicators extracted from the experiments depend on the scenario being evaluated:
Commons Harvest Open:
Externally Mushrooms:
Coins:
The Comparison_baselines_experiments aim to:
These experiments help evaluate the robustness of LAAs in scenarios with varying complexity and social dilemmas, providing insights into their potential applications in real-world cooperative systems.
In each simulation:
Participants:
Action Dynamics:
Metrics and Indicators:
This repository enables reproducibility and serves as a benchmark for advancing research into cooperative and competitive behaviors in LLM-based agents.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The use of prompts for diverse tasks becomes more prevalent, concerns arise regarding the security of information shared between models and users, as LLMs face vulnerability in receiving potentially harmful prompts with malicious intent from users.
Vulnerabilities associated with prompt engineering can range from bias and inappropriate responses to cybersecurity issues, raising fundamental questions about the ethics, transparency, and accountability that surround the use of these advanced technologies.
As the number one of the main current vulnerability of LLMs, prompt injection is the insertion of instructions to alter the expected behavior of the output of a Large Language Model and is usually embedded in the prompt. It can range from simple changes in configured behavior to malicious code snippets that compromise the models integrity and information.
We introduce a dataset, named Malignant, specifically curated for jailbreak prompt injection instances. A jailbreak attack is based on adversarial inputs, where their purpose is to break the safe model behavior as the model’s output produces harmful content.
This dataset serves as a valuable resource for future research endeavors aimed at addressing prompt injection vulnerabilities.
The methodology paper and models already trained scripts can be found here: - https://github.com/llm-security-research/malicious-prompts - https://vinbinary.xyz/malignant_and_promptsentinel.pdf
category: Three categories can be found: - jailbreak: We gathered 70 prompts from the jailbreak portal (it is not available since 2024), focusing on the theme of jailbreak attacks and curating with established patterns in such scenarios. Through data augmentation, we produced 129 paraphrased jailbreak prompts. In total, the malignant dataset consists of 199 jailbreak prompts. - act_as: We augmented the robustness of model detection for jailbreak prompt injection by introducing hard prompts. A distinct category for hard prompts is integrated into the malignant dataset, sourced from the AweosomeChatGPT portal. Also referred to as manual prompts, these inputs serve as role prompts to condition the context, influencing the behavior of the language model. With 24 initially collected prompts, we applied the rephrase method for dataset augmentation, yielding a total of 69 hard prompts after a results review. - conversation: In order to evaluate a model to detect jailbreak prompts, conversation prompts for model training were extracted solely from the Persona-Chat dataset, with a total of 1312 prompts included.
base_class: Six categories can be found:
- paraphrase: Data augmentation was performed on jailbreak prompts to achieve better results in model training.
- conversation: Phrase collected from Persona-Chat dataset.
- role_play:
- output_constraint:
- privilege_escalation:
text: The string phrase collected from the datasources listed below.
embedding: Text embeddings generated using the model paraphrase-multilingual-MiniLM-L12-v2 from SentenceTransformers to generate 384 dimensional embeddings.
As the only public dataset available to our knowledge at this time, we hope it can be useful for researchers and people who are concerned about AI ethics and want to make a difference!
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the augmented data and labels used in training the model, it is also needed for evaluation as the vectoriser is fit on this data and then the test data is transformed on that vectoriser
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains synthetic Arabic tweets generated by the AMATC-LLM framework, an augmentation system developed on top of the original ArSarcasm-v2 corpus. The data were produced through a two-stage process that combines human conceptual abstraction with controlled large language model (LLM) generation to create context-rich and dialect-aware Arabic text. Each record includes a generated tweet labeled for sarcasm (TRUE or FALSE), sentiment (POS, NEG, or NEU), and dialect (magreb, egypt, levant, gulf, or msa). Only LLM-generated samples are included; the original ArSarcasm-v2 data are excluded to respect their license. This resource supports research in Arabic multi-task learning, sarcasm detection, sentiment analysis, and dialect identification, with a focus on low-resource and multi-dialect Arabic NLP. Original dataset before adding our LLM augmentation data is available at: https://github.com/iabufarha/ArSarcasm-v2
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
📚 Augmented LLM-Generated NER Dataset for Scholarly Text
🧠 Dataset Summary
This dataset contains synthetically generated academic text tailored for Named Entity Recognition (NER) in the software engineering domain. The synthetic data augments scholarly writing using large language models (LLMs), with entity consistency maintained via token preservation. The dataset is generated by merging and rephrasing pairs of annotated sentences from scholarly papers using… See the full description on the dataset page: https://huggingface.co/datasets/psresearch/augmented_dataset_llm_generated_NER.
Facebook
Twitterhttps://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
Multilingual LLM Market Size 2025-2029
The multilingual LLM market size is valued to increase by USD 10.69 billion, at a CAGR of 31% from 2024 to 2029. Increasing globalization and imperative for seamless cross-border communication will drive the multilingual LLM market.
Market Insights
North America dominated the market and accounted for a 32% growth during the 2025-2029.
By Deployment - On-premises segment was valued at USD 933.40 billion in 2023
By Application - Content generation and curation segment accounted for the largest market revenue share in 2023
Market Size & Forecast
Market Opportunities: USD 822.91 million
Market Future Opportunities 2024: USD 10691.90 million
CAGR from 2024 to 2029 : 31%
Market Summary
The Multilingual LLM (Large Language Model) market is experiencing significant growth due to the increasing globalization and imperative for seamless cross-border communication. As businesses expand internationally, the need for multilingual capabilities becomes crucial. This trend is further accentuated by the shift from text-centric to multimodal capabilities, as organizations seek to engage with customers in a more interactive and inclusive manner. However, the market also presents unique challenges. Data scarcity and quality for low-resource languages remain major hurdles, limiting the effectiveness of language models in these regions. To address this issue, there is a growing focus on collaborative efforts to build and improve multilingual datasets, as well as advancements in transfer learning and multilingual models.
Key technologies such as edge computing, augmented reality, and virtual reality are also contributing to the market's expansion. For instance, a global manufacturing company may rely on multilingual LLMs to optimize its supply chain by accurately processing and analyzing data from various sources in different languages. This can lead to improved operational efficiency, reduced errors, and enhanced customer satisfaction. Despite these benefits, the market faces ongoing challenges, including the need for continuous model improvement, data privacy concerns, and the ethical implications of language models in diverse cultural contexts.
What will be the size of the Multilingual LLM Market during the forecast period?
Get Key Insights on Market Forecast (PDF) Request Free Sample
The Multilingual LLM (Large Language Model) Market is an ever-evolving landscape, driven by the increasing demand for cross-lingual communication and understanding in businesses worldwide. A recent study reveals that over 70% of multinational corporations operate in more than one language, underscoring the necessity for advanced language models to facilitate seamless communication and streamline operations. Multilingual LLMs employ advanced techniques such as model scaling, syntactic parsing, and named entity recognition to understand and generate human language in various languages. These models have proven instrumental in various business applications, including customer support, content localization, and compliance with international regulations.
Moreover, the integration of multilingual LLMs into business processes has led to significant improvements in efficiency. For instance, companies have reported a 25% reduction in response time to customer queries in non-English languages, leading to enhanced customer satisfaction and loyalty. The continuous advancements in multilingual LLMs, including improvements in model performance benchmarks, ethical considerations, and responsible AI, ensure that businesses can effectively communicate and collaborate across linguistic and cultural boundaries. As the global business landscape becomes increasingly interconnected, the demand for multilingual LLMs is poised to grow, making it a strategic investment for companies seeking to expand their reach and foster international partnerships.
Unpacking the Multilingual LLM Market Landscape
In the dynamic business landscape, the multilingual Large Language Model (LLM) market continues to gain significance. Neural machine translation, fueled by advanced tokenization techniques and contextual understanding, delivers translation accuracy improvements of up to 20% compared to rule-based systems. Furthermore, language models employing interpretable AI and semantic analysis enhance cross-lingual transfer by 30%, aligning with business compliance requirements. Parallel text processing and domain adaptation techniques optimize machine translation quality, resulting in cost savings of up to 15% in localization projects. The integration of coherence assessment, text generation models, and model explainability via Meteor score metrics and deep learning architectures further boosts efficiency. Attention mechanisms, few-shot learning, and zero-shot learning enable seamless handling of diverse language data, while data augmentation strategies
Facebook
Twitterhttps://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-4605https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-4605
Integrating multiple (sub-)systems is essential to create advanced Information Systems (ISs). Difficulties mainly arise when integrating dynamic environments across the IS lifecycle, e.g., services not yet existent at design time. A traditional approach is a registry that provides the API documentation of the systems’ endpoints. Large Language Models (LLMs) have shown to be capable of automatically creating system integrations (e.g., as service composition) based on this documentation but require concise input due to input token limitations, especially regarding comprehensive API descriptions. Currently, it is unknown how best to preprocess these API descriptions. Within this work, we (i) analyze the usage of Retrieval Augmented Generation (RAG) for endpoint discovery and the chunking, i.e., preprocessing, of state-of-practice OpenAPIs to reduce the input token length while preserving the most relevant information. To further reduce the input token length for the composition prompt and improve endpoint retrieval, we propose (ii) a Discovery Agent that only receives a summary of the most relevant endpoints and retrieves specification details on demand. We evaluate RAG for endpoint discovery using the RestBench benchmark, first, for the different chunking possibilities and parameters measuring the endpoint retrieval recall, precision, and F1 score. Then, we assess the Discovery Agent using the same test set. With our prototype, we demonstrate how to successfully employ RAG for endpoint discovery to reduce token count. While revealing high values for recall, precision, and F1, further research is necessary to retrieve all requisite endpoints. Our experiments show that for preprocessing, LLM-based and format-specific approaches outperform naïve chunking methods. Relying on an agent further enhances these results as the agent splits the tasks into multiple fine granular subtasks, improving the overall RAG performance in the token count, precision, and F1 score. Content: code.zip:Python source code to perform the experiments. evaluate.py: Script to execute the experiments (Uncomment lines to select the embedding model). socrag/*: Source code for the RAG. benchmark/*: RestBench specification. results.zip:Results of the RAG experiments (in the folder /results/data/ inside the zip file). Experiment results for the RAG: results_{embedding_model}_{top-k}.json. Experiment results for the Discovery Agent: results_{embedding_model}_{agent}_{refinement}_{llm}.json. FAISS store (intermediate data required for exact reproduction of results; one folder for each embedding model): bge_small, nvidia and oai. Intermediate data of the LLM-based refinement methods required for the exact reproduction of results: *_parser.json.
Facebook
TwitterComprehensive training data on 1M+ stores across the US & Canada. Includes detailed menus, inventory, pricing, and availability. Ideal for AI/ML models, powering retrieval-augmented generation, search, and personalization systems.
Facebook
Twitterthis is a dataset with LLM generated data to augment training data for OVOS intent classifiers new sentences may be added over time, mostly focused on the intents with few samples or that the intent classifier model is having trouble learning
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LAURA: Enhancing Code Review Generation with Context-Enriched Retrieval-Augmented LLMIntroductionLAURA is an LLM-based retrieval-augmented, context-aware framework for code review generation, which integrates context augmentation, review exemplar retrieval, and prompt tuning to enhance the performance of LLMs (in our study, ChatGPT-4o and DeepSeek v3) in generating code review comments.The experiments show that LAURA outperforms the direct application of ChatGPT-4o and DeepSeek v3 for code review generation and significantly surpasses the performance of the pre-trained model CodeReviewer.Since our experiments are based on ChatGPT-4o and DeepSeek v3, we have released the data processing code and dataset used in our research. The code section includes the Python scripts we used for data collection, cleaning, merging, and retrieval. The dataset section contains 301k entries from 1,807 high-quality projects sourced from GitHub, covering four programming languages: C, C++, Java, and Python. We also provide the time-split dataset used as the retrieval database (which is also used for fine-tuning CodeReviewer) and the human-annotated evaluation dataset.File Structurecodes: Data collection, filtering and post-processing codes used in our studydata_collection_and_filtering.py: Code for collecting data via the GitHub GraphQL API and filtering with rule-based and LLM-based methodsdata_embedding.py: Code for data embeddingdata_merging.py: Code for data merging, used to merge the review comments with the same target diffdata_retrieval.py: Code for data retrievaldiff_extension.py: Code for extending the code diffs by integrating the full code contexts into the diffsdatasets: Datasets built and used in our studydatabase_for_retrieve.csv: The dataset we built for retrieval-augmented generation, containing 298,494 entries prior to December 26, 2024evaluation_data.csv: The evaluation dataset we manually annotated, containing 384 entries later than December 26, 2024full_dataset.csv: The full dataset we collected, containing 301,256 entriesprompts: The prompts used in data filtering, generation and evaluationdirect_generation.txt: The prompt we used for direct generation as baselinesLAURA_generation.txt: The prompt we used for LAURA generationLLM_evaluation.txt: The prompt we used for LLM evaluationLLM_filtering.txt: The prompt we used for LLM filtering in data filtering processREADME.md: Description of our submission
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset demonstrates how to fuse Large Language Model (LLM) generated embeddings with Graph Neural Networks (GNNs) for learning on tabular graph data.
sample_nodes.csv – Node features including ID, category, and description textsample_edges.csv – Edge list (source, target, weight)sample_augmented_nodes.csv – Node features + LLM-generated embeddings (simulated)GNN_LLM_Hybrid_Baseline.ipynb – Main baseline model using PyTorch GeometricCSV_Processing_1.ipynb – Basic loading and EDA of nodes/edgesCSV_Processing_2.ipynb – Preview of LLM-augmented node featuresThis is a synthetic dataset. For real-world use: - Replace the "LLM embeddings" with outputs from OpenAI / Mistral / HuggingFace models - Extend the node descriptions with actual context or domain-specific text - Scale to real-world graphs or use with competition tabular datasets
Facebook
Twitter50,000 Sets - Image Editing Data. The editing types include human attribute editing, image semantic editing, and image structure editing. The editing targets cover scenes such as people, animals, goods, plants, and landscapes. In terms of annotation, based on the editing instructions, the targets that need to be edited in the image are edited. The data can be used for tasks such as image synthesis, data augmentation, and virtual scene generation.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset is derived from these datasets - https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset & the dataset https://www.kaggle.com/datasets/jdragonxherrera/augmented-data-for-llm-detect-ai-generated-text/
Artificial errors are introduced into the above dataset in order to mimic the test data of the competition - "LLM - Detect AI Generated Text"
The notebook used for creating this dataset - https://www.kaggle.com/code/murugesann/daigt-v2-train-dataset-with-typos
USE ONLY VERSION 1 - IT IS TYPO INTRODUCED - IGNORE VERSION 2 AND 3
VERSION 2 & 3 ARE CORRECTED DATASETS (typos are corrected instead of being introduced - due to mistake !! :-)
~~You can pin to different versions as follows:
version 1 - daigt-v2-train-dataset - entire dataset is introduced with typos - can be used for training with typos version 2 - augmented-data-for-llm-detect-ai-generated-text is sampled with 9000 essays of equal label weight and introduced typos - can be used for CV testing version 3 - daigt-v2-train-dataset is split into train and test - only test is introduced with typos - can be used for CV testing~~
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
CodeLLMExp is a comprehensive, large-scale, multi-language, and multi-vulnerability dataset created to advance research into the security of AI-generated code. It is specifically designed to train and evaluate machine learning models, such as Large Language Models (LLMs), on the joint tasks of Automated Vulnerability Localization (AVL) and Explainable AI (XAI).
The dataset was constructed through a rigorous pipeline that involved sourcing prompts from established security benchmarks (CodeLMSec, SecurityEval, Copilot CWE Scenarios), employing seed augmentation to ensure coverage of under-represented Common Weakness Enumerations (CWEs), and using a chain of LLMs to generate vulnerable code snippets. This raw data was then automatically evaluated for quality by an "LLM-as-judge" (validated against human experts with a Spearman correlation of 0.8545) and enriched with structured annotations.
CodeLLMExp covers three of the most widely used programming languages : Python, Java and C. It contains 10,400 high-quality examples across Python (44.3%), Java (29.6%), and C (26.1%). It focuses on 29 distinct CWEs, including the complete CWE Top 25 Most Dangerous Software Errors (2024. Each record in the dataset provides a vulnerable code snippet, the precise line number of the flaw, a structured explanation (root cause, impact, mitigation), and a fixed version of the code.
By providing richly annotated data for detection, classification, localization, and explanation, CodeLLMExp enables the development of more robust and transparent security analysis tools. It facilitates research into LLM adaptation strategies (e.g., prompting, fine-tuning, Retrieval-Augmented Generation), automated program repair, and the inherent security patterns of code produced by AI.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this research, we utilise the capabilities of LLMs to parse the IFC data with Graph Retrieval-Augmented Generation (Graph-RAG) technique to retrieve building object properties and their relations. We will show that, despite limitations due to the complex hierarchy of the IFC data, the Graph-RAG parsing enhances generative LLMs like GPT-4o with graph-based knowledge, enabling natural language query-response retrieval without the need for a complex pipeline.IFC data has become the general building information standard for collaborative work in the construction industry. However, IFC data can be very complicated because it allows for multiple ways to represent the same product information. In this research, we utilise the capabilities of LLMs to parse the IFC data with Graph Retrieval-Augmented Generation (Graph-RAG) technique to retrieve building object properties and their relations. We will show that, despite limitations due to the complex hierarchy of the IFC data, the Graph-RAG parsing enhances generative LLMs like GPT-4o with graph-based knowledge, enabling natural language query-response retrieval without the need for a complex pipeline.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset takes the original data from the following contributions:
https://www.kaggle.com/datasets/radek1/llm-generated-essays https://www.kaggle.com/datasets/alejopaullier/argugpt https://www.kaggle.com/datasets/nbroad/daigt-data-llama-70b-and-falcon180b https://www.kaggle.com/datasets/thedrcat/daigt-proper-train-dataset https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset https://www.kaggle.com/datasets/darraghdog/hello-claude-1000-essays-from-anthropic https://www.kaggle.com/datasets/carlmcbrideellis/llm-7-prompt-training-dataset https://www.kaggle.com/datasets/nbroad/persaude-corpus-2
Some of those already compile the others inside them, so I first removed the duplicates comparing by full text. After that the data augmentation took place, with a process composed of 2 steps that were iterated over and over, first I tried correcting typos on the texts by using language_tool_python, then I introduced noise the way the organizators seem to have done it (see https://www.kaggle.com/competitions/llm-detect-ai-generated-text/discussion/452279), then I corrected typos again, and repeat. After repeating these steps a couple of times I removed duplicates again comparing by full text.
The result is this dataset, it's split in train and test because I wanted to prevent information leaking between the train and test sections, so I did the steps independently on each of them (I split them before doing the data augmentation). If you don't care about train and test you can just concatenate both into a single dataset for training purposes.
If you find this dataset helpful, please upvote.