35 datasets found

Augmented data for LLM - Detect AI Generated Text
kaggle.com
zip
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonathan Herrera (2023). Augmented data for LLM - Detect AI Generated Text [Dataset]. https://www.kaggle.com/jdragonxherrera/augmented-data-for-llm-detect-ai-generated-text
Explore at:
zip(328850388 bytes)Available download formats
Dataset updated
Nov 21, 2023
Authors
Jonathan Herrera
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset takes the original data from the following contributions:

https://www.kaggle.com/datasets/radek1/llm-generated-essays https://www.kaggle.com/datasets/alejopaullier/argugpt https://www.kaggle.com/datasets/nbroad/daigt-data-llama-70b-and-falcon180b https://www.kaggle.com/datasets/thedrcat/daigt-proper-train-dataset https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset https://www.kaggle.com/datasets/darraghdog/hello-claude-1000-essays-from-anthropic https://www.kaggle.com/datasets/carlmcbrideellis/llm-7-prompt-training-dataset https://www.kaggle.com/datasets/nbroad/persaude-corpus-2

Some of those already compile the others inside them, so I first removed the duplicates comparing by full text. After that the data augmentation took place, with a process composed of 2 steps that were iterated over and over, first I tried correcting typos on the texts by using language_tool_python, then I introduced noise the way the organizators seem to have done it (see https://www.kaggle.com/competitions/llm-detect-ai-generated-text/discussion/452279), then I corrected typos again, and repeat. After repeating these steps a couple of times I removed duplicates again comparing by full text.

The result is this dataset, it's split in train and test because I wanted to prevent information leaking between the train and test sections, so I did the steps independently on each of them (I split them before doing the data augmentation). If you don't care about train and test you can just concatenate both into a single dataset for training purposes.

If you find this dataset helpful, please upvote.
Data from: Leveraging LLM-Respondents for Item Evaluation: a Psychometric...
figshare.com
csv
Updated Oct 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yunting Liu; Shreya Bhandari; Zachary Pardos (2024). Leveraging LLM-Respondents for Item Evaluation: a Psychometric Analysis [Dataset]. http://doi.org/10.6084/m9.figshare.27263496.v1
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27263496.v1
Dataset updated
Oct 20, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Yunting Liu; Shreya Bhandari; Zachary Pardos
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains 1,165 rows, each corresponding to a respondent (including LLM-generated respondents) in our study. It contains 21 columns. The first column, "Generating Model," specifies the model or source (e.g., "Human") that generated the responses. The remaining 20 columns (Q1 to Q20) indicate the correctness of answers to 20 college algebra questions for each respondent. "TRUE" means the respondent answered correctly, "FALSE" indicates an incorrect answer, and N/A represents missing data (i.e., no response). The dataset includes responses from seven different generating models:Human: 265 responsesGPT-4: 150 responsesGPT-3.5: 150 responsesLlama 3: 150 responsesLlama 2: 150 responsesGemini: 150 responsesCohere: 150 responses
augmented-data-for-llm-detect-ai-generated-text
kaggle.com
zip
Updated Jan 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
JISU KIM8873 (2024). augmented-data-for-llm-detect-ai-generated-text [Dataset]. https://www.kaggle.com/datasets/jisukim8873/augmented-data-for-llm-detect-ai-generated-text/data
Explore at:
zip(328850388 bytes)Available download formats
Dataset updated
Jan 22, 2024
Authors
JISU KIM8873
Description
Dataset

This dataset was created by JISU KIM8873

Contents
f
Supplementary file 1_LLMCARE: early detection of cognitive impairment via...
frontiersin.figshare.com
docx
Updated Nov 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ali Zolnour; Hossein Azadmaleki; Yasaman Haghbin; Fatemeh Taherinezhad; Mohamad Javad Momeni Nezhad; Sina Rashidi; Masoud Khani; AmirSajjad Taleban; Samin Mahdizadeh Sani; Maryam Dadkhah; James M. Noble; Suzanne Bakken; Yadollah Yaghoobzadeh; Abdol-Hossein Vahabie; Masoud Rouhizadeh; Maryam Zolnoori (2025). Supplementary file 1_LLMCARE: early detection of cognitive impairment via transformer models enhanced by LLM-generated synthetic data.docx [Dataset]. http://doi.org/10.3389/frai.2025.1669896.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/frai.2025.1669896.s001
Dataset updated
Nov 6, 2025
Dataset provided by
Frontiers
Authors
Ali Zolnour; Hossein Azadmaleki; Yasaman Haghbin; Fatemeh Taherinezhad; Mohamad Javad Momeni Nezhad; Sina Rashidi; Masoud Khani; AmirSajjad Taleban; Samin Mahdizadeh Sani; Maryam Dadkhah; James M. Noble; Suzanne Bakken; Yadollah Yaghoobzadeh; Abdol-Hossein Vahabie; Masoud Rouhizadeh; Maryam Zolnoori
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundAlzheimer’s disease and related dementias (ADRD) affect nearly five million older adults in the United States, yet more than half remain undiagnosed. Speech-based natural language processing (NLP) provides a scalable approach to identify early cognitive decline by detecting subtle linguistic markers that may precede clinical diagnosis.ObjectiveThis study aims to develop and evaluate a speech-based screening pipeline that integrates transformer-based embeddings with handcrafted linguistic features, incorporates synthetic augmentation using large language models (LLMs), and benchmarks unimodal and multimodal LLM classifiers. External validation was performed to assess generalizability to an MCI-only cohort.MethodsTranscripts were obtained from the ADReSSo 2021 benchmark dataset (n = 237; derived from the Pitt Corpus, DementiaBank) and the DementiaBank Delaware corpus (n = 205; clinically diagnosed mild cognitive impairment [MCI] vs. controls). Audio was automatically transcribed using Amazon Web Services Transcribe (general model). Ten transformer models were evaluated under three fine-tuning strategies. A late-fusion model combined embeddings from the best-performing transformer with 110 linguistically derived features. Five LLMs (LLaMA-8B/70B, MedAlpaca-7B, Ministral-8B, GPT-4o) were fine-tuned to generate label-conditioned synthetic speech for data augmentation. Three multimodal LLMs (GPT-4o, Qwen-Omni, Phi-4) were tested in zero-shot and fine-tuned settings.ResultsOn the ADReSSo dataset, the fusion model achieved an F1-score of 83.32 (AUC = 89.48), outperforming both transformer-only and linguistic-only baselines. Augmentation with MedAlpaca-7B synthetic speech improved performance to F1 = 85.65 at 2 × scale, whereas higher augmentation volumes reduced gains. Fine-tuning improved unimodal LLM classifiers (e.g., MedAlpaca-7B, F1 = 47.73 → 78.69), while multimodal models demonstrated lower performance (Phi-4 = 71.59; GPT-4o omni = 67.57). On the Delaware corpus, the pipeline generalized to an MCI-only cohort, with the fusion model plus 1 × MedAlpaca-7B augmentation achieving F1 = 72.82 (AUC = 69.57).ConclusionIntegrating transformer embeddings with handcrafted linguistic features enhances ADRD detection from speech. Distributionally aligned LLM-generated narratives provide effective but bounded augmentation, while current multimodal models remain limited. Crucially, validation on the Delaware corpus demonstrates that the proposed pipeline generalizes to early-stage impairment, supporting its potential as a scalable approach for clinically relevant early screening. All codes for LLMCARE are publicly available at: GitHub.

Open-Source LLM Market Analysis, Size, and Forecast 2025-2029: North America...

technavio.com

pdf

Updated Jul 10, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Technavio (2025). Open-Source LLM Market Analysis, Size, and Forecast 2025-2029: North America (US, Canada, and Mexico), Europe (France, Germany, and UK), APAC (China, India, Japan, and South Korea), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/open-source-llm-market-industry-analysis

Explore at:

pdfAvailable download formats

Dataset updated

Jul 10, 2025

Dataset provided by

TechNavio

Authors

Technavio

License

https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice

Time period covered

2025 - 2029

Area covered

United Kingdom, Canada, Germany, United States

Description

Snapshot img

Open-Source LLM Market Size 2025-2029

The open-source LLM market size is valued to increase by USD 54 billion, at a CAGR of 33.7% from 2024 to 2029. Increasing democratization and compelling economics will drive the open-source LLM market.

Market Insights

North America dominated the market and accounted for a 37% growth during the 2025-2029.
By Application - Technology and software segment was valued at USD 4.02 billion in 2023
By Deployment - On-premises segment accounted for the largest market revenue share in 2023

Market Size & Forecast

Market Opportunities: USD 575.60 million 
Market Future Opportunities 2024: USD 53995.50 million
CAGR from 2024 to 2029 : 33.7%

Market Summary

The Open-Source Large Language Model (LLM) market has experienced significant growth due to the increasing democratization of artificial intelligence (AI) technology and its compelling economics. This global trend is driven by the proliferation of smaller organizations seeking to leverage advanced language models for various applications, including supply chain optimization, compliance, and operational efficiency. Open-source LLMs offer several advantages over proprietary models. They provide greater flexibility, as users can modify and adapt the models to their specific needs. Additionally, open-source models often have larger training datasets, leading to improved performance and accuracy. However, there are challenges to implementing open-source LLMs, such as the prohibitive computational costs and critical hardware dependency. These obstacles necessitate the development of more efficient algorithms and the exploration of cloud computing solutions.
A real-world business scenario illustrates the potential benefits of open-source LLMs. A manufacturing company aims to optimize its supply chain by implementing an AI-powered system to analyze customer demand patterns and predict inventory needs. The company chooses an open-source LLM due to its flexibility and cost-effectiveness. By integrating the LLM into its supply chain management system, the company can improve forecasting accuracy and reduce inventory costs, ultimately increasing operational efficiency and customer satisfaction. Despite the challenges, the market continues to grow as organizations recognize the potential benefits of advanced language models. The democratization of AI technology and the compelling economics of open-source solutions make them an attractive option for businesses of all sizes.

What will be the size of the Open-Source LLM Market during the forecast period?

Get Key Insights on Market Forecast (PDF) Request Free Sample

The Open-Source Large Language Model (LLM) Market continues to evolve, offering businesses innovative solutions for various applications. One notable trend is the increasing adoption of explainable AI (XAI) methods in LLMs. XAI models provide transparency into the reasoning behind their outputs, addressing concerns around bias mitigation and interpretability. This transparency is crucial for industries with stringent compliance requirements, such as finance and healthcare. For instance, a recent study reveals that companies implementing XAI models have achieved a 25% increase in model acceptance rates among stakeholders, leading to more informed decisions. This improvement can significantly impact product strategy and budgeting, as businesses can confidently invest in AI solutions that align with their ethical and regulatory standards.
Moreover, advancements in LLM architecture include encoder-decoder architectures, multi-head attention, and self-attention layers, which enhance feature extraction and model scalability. These improvements contribute to better performance and more accurate results, making LLMs an essential tool for businesses seeking to optimize their operations and gain a competitive edge. In summary, the market is characterized by continuous innovation and a strong focus on delivering human-centric solutions. The adoption of explainable AI methods and advancements in neural network architecture are just a few examples of how businesses can benefit from these technologies. By investing in Open-Source LLMs, organizations can improve efficiency, enhance decision-making, and maintain a responsible approach to AI implementation.

Unpacking the Open-Source LLM Market Landscape

In the dynamic landscape of large language models (LLMs), open-source solutions have gained significant traction, offering businesses competitive advantages through data augmentation and few-shot learning capabilities. Compared to traditional models, open-source LLMs enable a 30% reduction in optimizer selection time and a 25% improvement in model accuracy for summarization tasks. Furthermore, distributed training and model compression techniques allow businesses to process larger training dataset sizes with minimal tokenization process disruptions, result

Supported data for manuscript "Can LLM-Augmented autonomous agents...
zenodo.org
data.niaid.nih.gov
Updated Dec 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruben Manrique; Manuel Mosquera; Juan Sebastian Pinzon; Manuel Rios; Nicanor Quijano; Luis Felipe Giraldo; Ruben Manrique; Manuel Mosquera; Juan Sebastian Pinzon; Manuel Rios; Nicanor Quijano; Luis Felipe Giraldo (2024). Supported data for manuscript "Can LLM-Augmented autonomous agents cooperate?, An evaluation of their cooperative capabilities through Melting Pot" [Dataset]. http://doi.org/10.5281/zenodo.14287158
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.14287158
Dataset updated
Dec 6, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ruben Manrique; Manuel Mosquera; Juan Sebastian Pinzon; Manuel Rios; Nicanor Quijano; Luis Felipe Giraldo; Ruben Manrique; Manuel Mosquera; Juan Sebastian Pinzon; Manuel Rios; Nicanor Quijano; Luis Felipe Giraldo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Dec 6, 2024
Description
The repository data corresponds partially to the manuscript titled "Can LLM-Augmented Autonomous Agents Cooperate? An Evaluation of Their Cooperative Capabilities through Melting Pot," submitted to IEEE Transactions on Artificial Intelligence. The dataset comprises experiments conducted with Large Language Model-Augmented Autonomous Agents (LAAs), as implemented in the ["Cooperative Agents" repository](https://github.com/Cooperative-IA/CooperativeGPT/tree/main), using substrates from the Melting Pot framework.

Dataset Scope

This dataset is divided into two main experiment categories:

Personality_experiments:

These focus on a single scenario (Commons Harvest) to assess various agent personalities and their cooperative dynamics.

Comparison_baselines_experiments:

These experiments include three distinct scenarios designed by Melting Pot:

Commons Harvest Open

Externally Mushrooms

Coins

These scenarios evaluate different cooperative and competitive behaviors among agents and are used to compare decision-making architectures of LAAs against reinforcement learning (RL) baselines. Unlike the Personality_experiments, these comparisons do not involve bots but exclusively analyze RL and LAA architectures.

Scenarios and Metrics

The metrics and indicators extracted from the experiments depend on the scenario being evaluated:

Commons Harvest Open:

Focus: Resource consumption and environmental impact.

Metrics include:

Number of apples consumed.

Devastation of trees (i.e., depletion of resources).

Externally Mushrooms:

Focus: Self-interest vs. collective benefit.

Agents consume mushrooms with different outcomes:

Mushrooms that benefit the individual.

Mushrooms that benefit everyone.

Mushrooms that benefit only others.

Mushrooms that benefit the individual but penalize others.

Metrics evaluate trade-offs between individual gain and collective welfare.

Coins:

Focus: Reciprocity and fairness.

Agents collect coins with two options:

Collect their own color coin for a reward.

Collect a different color coin, which grants a reward to the agent but penalizes the other.

Metrics include reciprocity rates and the balance of mutual benefits.

Objectives of Comparison Experiments

The Comparison_baselines_experiments aim to:

Assess how LAAs compare to RL baselines in cooperative and competitive tasks across diverse scenarios.

Compare decision-making architectures within LAAs, including chain-of-thought and generative approaches.

These experiments help evaluate the robustness of LAAs in scenarios with varying complexity and social dilemmas, providing insights into their potential applications in real-world cooperative systems.

Simulation Details (Applicable to All Experiments)

In each simulation:

Participants:

Experiments involve predefined numbers of LAAs or RL agents.

No bots are included in Comparison_baselines_experiments.

Action Dynamics:

Each agent performs high-level actions sequentially.

Simulations conclude either after reaching a preset maximum number of rounds (typically 100) or prematurely if the scenario's resources are fully depleted.

Metrics and Indicators:

Extracted metrics depend on the scenario and include measures of individual performance, collective outcomes, and agent reciprocity.

This repository enables reproducibility and serves as a benchmark for advancing research into cooperative and competitive behaviors in LLM-based agents.
Prompt Injection Malignant
kaggle.com
zip
Updated Apr 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mary Camila (2024). Prompt Injection Malignant [Dataset]. https://www.kaggle.com/datasets/marycamilainfo/prompt-injection-malignant
Explore at:
zip(3216383 bytes)Available download formats
Dataset updated
Apr 25, 2024
Authors
Mary Camila
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Prompt Injection

The use of prompts for diverse tasks becomes more prevalent, concerns arise regarding the security of information shared between models and users, as LLMs face vulnerability in receiving potentially harmful prompts with malicious intent from users.

Vulnerabilities associated with prompt engineering can range from bias and inappropriate responses to cybersecurity issues, raising fundamental questions about the ethics, transparency, and accountability that surround the use of these advanced technologies.

As the number one of the main current vulnerability of LLMs, prompt injection is the insertion of instructions to alter the expected behavior of the output of a Large Language Model and is usually embedded in the prompt. It can range from simple changes in configured behavior to malicious code snippets that compromise the models integrity and information.

Dataset Overview

We introduce a dataset, named Malignant, specifically curated for jailbreak prompt injection instances. A jailbreak attack is based on adversarial inputs, where their purpose is to break the safe model behavior as the model’s output produces harmful content.

This dataset serves as a valuable resource for future research endeavors aimed at addressing prompt injection vulnerabilities.

The methodology paper and models already trained scripts can be found here: - https://github.com/llm-security-research/malicious-prompts - https://vinbinary.xyz/malignant_and_promptsentinel.pdf

Column Description:

category: Three categories can be found: - jailbreak: We gathered 70 prompts from the jailbreak portal (it is not available since 2024), focusing on the theme of jailbreak attacks and curating with established patterns in such scenarios. Through data augmentation, we produced 129 paraphrased jailbreak prompts. In total, the malignant dataset consists of 199 jailbreak prompts. - act_as: We augmented the robustness of model detection for jailbreak prompt injection by introducing hard prompts. A distinct category for hard prompts is integrated into the malignant dataset, sourced from the AweosomeChatGPT portal. Also referred to as manual prompts, these inputs serve as role prompts to condition the context, influencing the behavior of the language model. With 24 initially collected prompts, we applied the rephrase method for dataset augmentation, yielding a total of 69 hard prompts after a results review. - conversation: In order to evaluate a model to detect jailbreak prompts, conversation prompts for model training were extracted solely from the Persona-Chat dataset, with a total of 1312 prompts included.

base_class: Six categories can be found:
- paraphrase: Data augmentation was performed on jailbreak prompts to achieve better results in model training. - conversation: Phrase collected from Persona-Chat dataset. - role_play: - output_constraint: - privilege_escalation:

text: The string phrase collected from the datasources listed below.

embedding: Text embeddings generated using the model paraphrase-multilingual-MiniLM-L12-v2 from SentenceTransformers to generate 384 dimensional embeddings.

As the only public dataset available to our knowledge at this time, we hope it can be useful for researchers and people who are concerned about AI ethics and want to make a difference!
Augmented training data and labels, used for training the models
figshare.com
bin
Updated Mar 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Keane (2025). Augmented training data and labels, used for training the models [Dataset]. http://doi.org/10.6084/m9.figshare.28669001.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28669001.v1
Dataset updated
Mar 26, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Michael Keane
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the augmented data and labels used in training the model, it is also needed for evaluation as the vectoriser is fit on this data and then the test data is transformed on that vectoriser
H
AMATC-LLM Augmented ArSarcasm-v2
dataverse.harvard.edu
Updated Nov 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmed El Sabagh; Shahira Azazy; Hesham Hefny (2025). AMATC-LLM Augmented ArSarcasm-v2 [Dataset]. http://doi.org/10.7910/DVN/REGHGA
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/REGHGA
Dataset updated
Nov 16, 2025
Dataset provided by
Harvard Dataverse
Authors
Ahmed El Sabagh; Shahira Azazy; Hesham Hefny
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains synthetic Arabic tweets generated by the AMATC-LLM framework, an augmentation system developed on top of the original ArSarcasm-v2 corpus. The data were produced through a two-stage process that combines human conceptual abstraction with controlled large language model (LLM) generation to create context-rich and dialect-aware Arabic text. Each record includes a generated tweet labeled for sarcasm (TRUE or FALSE), sentiment (POS, NEG, or NEU), and dialect (magreb, egypt, levant, gulf, or msa). Only LLM-generated samples are included; the original ArSarcasm-v2 data are excluded to respect their license. This resource supports research in Arabic multi-task learning, sarcasm detection, sentiment analysis, and dialect identification, with a focus on low-resource and multi-dialect Arabic NLP. Original dataset before adding our LLM augmentation data is available at: https://github.com/iabufarha/ArSarcasm-v2
h
augmented_dataset_llm_generated_NER
huggingface.co
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
psr_@_research (2025). augmented_dataset_llm_generated_NER [Dataset]. https://huggingface.co/datasets/psresearch/augmented_dataset_llm_generated_NER
Explore at:
Dataset updated
Apr 24, 2025
Authors
psr_@_research
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
📚 Augmented LLM-Generated NER Dataset for Scholarly Text

🧠 Dataset Summary

This dataset contains synthetically generated academic text tailored for Named Entity Recognition (NER) in the software engineering domain. The synthetic data augments scholarly writing using large language models (LLMs), with entity consistency maintained via token preservation. The dataset is generated by merging and rephrasing pairs of annotated sentences from scholarly papers using… See the full description on the dataset page: https://huggingface.co/datasets/psresearch/augmented_dataset_llm_generated_NER.

Multilingual LLM Market Analysis, Size, and Forecast 2025-2029: North...

technavio.com

pdf

Updated Jul 9, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Technavio (2025). Multilingual LLM Market Analysis, Size, and Forecast 2025-2029: North America (US and Canada), Europe (France, Germany, Italy, and UK), APAC (China, India, and Japan), South America (Brazil), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/multilingual-llm-market-industry-analysis

Explore at:

pdfAvailable download formats

Dataset updated

Jul 9, 2025

Dataset provided by

TechNavio

Authors

Technavio

License

https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice

Time period covered

2025 - 2029

Area covered

Canada, Germany, United States

Description

Snapshot img

Multilingual LLM Market Size 2025-2029

The multilingual LLM market size is valued to increase by USD 10.69 billion, at a CAGR of 31% from 2024 to 2029. Increasing globalization and imperative for seamless cross-border communication will drive the multilingual LLM market.

Market Insights

North America dominated the market and accounted for a 32% growth during the 2025-2029.
By Deployment - On-premises segment was valued at USD 933.40 billion in 2023
By Application - Content generation and curation segment accounted for the largest market revenue share in 2023

Market Size & Forecast

Market Opportunities: USD 822.91 million 
Market Future Opportunities 2024: USD 10691.90 million
CAGR from 2024 to 2029 : 31%

Market Summary

The Multilingual LLM (Large Language Model) market is experiencing significant growth due to the increasing globalization and imperative for seamless cross-border communication. As businesses expand internationally, the need for multilingual capabilities becomes crucial. This trend is further accentuated by the shift from text-centric to multimodal capabilities, as organizations seek to engage with customers in a more interactive and inclusive manner. However, the market also presents unique challenges. Data scarcity and quality for low-resource languages remain major hurdles, limiting the effectiveness of language models in these regions. To address this issue, there is a growing focus on collaborative efforts to build and improve multilingual datasets, as well as advancements in transfer learning and multilingual models.
Key technologies such as edge computing, augmented reality, and virtual reality are also contributing to the market's expansion. For instance, a global manufacturing company may rely on multilingual LLMs to optimize its supply chain by accurately processing and analyzing data from various sources in different languages. This can lead to improved operational efficiency, reduced errors, and enhanced customer satisfaction. Despite these benefits, the market faces ongoing challenges, including the need for continuous model improvement, data privacy concerns, and the ethical implications of language models in diverse cultural contexts.

What will be the size of the Multilingual LLM Market during the forecast period?

Get Key Insights on Market Forecast (PDF) Request Free Sample

The Multilingual LLM (Large Language Model) Market is an ever-evolving landscape, driven by the increasing demand for cross-lingual communication and understanding in businesses worldwide. A recent study reveals that over 70% of multinational corporations operate in more than one language, underscoring the necessity for advanced language models to facilitate seamless communication and streamline operations. Multilingual LLMs employ advanced techniques such as model scaling, syntactic parsing, and named entity recognition to understand and generate human language in various languages. These models have proven instrumental in various business applications, including customer support, content localization, and compliance with international regulations.
Moreover, the integration of multilingual LLMs into business processes has led to significant improvements in efficiency. For instance, companies have reported a 25% reduction in response time to customer queries in non-English languages, leading to enhanced customer satisfaction and loyalty. The continuous advancements in multilingual LLMs, including improvements in model performance benchmarks, ethical considerations, and responsible AI, ensure that businesses can effectively communicate and collaborate across linguistic and cultural boundaries. As the global business landscape becomes increasingly interconnected, the demand for multilingual LLMs is poised to grow, making it a strategic investment for companies seeking to expand their reach and foster international partnerships.

Unpacking the Multilingual LLM Market Landscape

In the dynamic business landscape, the multilingual Large Language Model (LLM) market continues to gain significance. Neural machine translation, fueled by advanced tokenization techniques and contextual understanding, delivers translation accuracy improvements of up to 20% compared to rule-based systems. Furthermore, language models employing interpretable AI and semantic analysis enhance cross-lingual transfer by 30%, aligning with business compliance requirements. Parallel text processing and domain adaptation techniques optimize machine translation quality, resulting in cost savings of up to 15% in localization projects. The integration of coherence assessment, text generation models, and model explainability via Meteor score metrics and deep learning architectures further boosts efficiency. Attention mechanisms, few-shot learning, and zero-shot learning enable seamless handling of diverse language data, while data augmentation strategies

D
Replication Data for: Advanced System Integration: Analyzing OpenAPI...
darus.uni-stuttgart.de
Updated Dec 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robin D. Pesl; Jerin George Mathew; Massimo Mecella; Marco Aiello (2024). Replication Data for: Advanced System Integration: Analyzing OpenAPI Chunking for Retrieval-Augmented Generation [Dataset]. http://doi.org/10.18419/DARUS-4605
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.18419/DARUS-4605
Dataset updated
Dec 9, 2024
Dataset provided by
DaRUS
Authors
Robin D. Pesl; Jerin George Mathew; Massimo Mecella; Marco Aiello
License
https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-4605https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-4605
Dataset funded by
BMWK
MWK
Description
Integrating multiple (sub-)systems is essential to create advanced Information Systems (ISs). Difficulties mainly arise when integrating dynamic environments across the IS lifecycle, e.g., services not yet existent at design time. A traditional approach is a registry that provides the API documentation of the systems’ endpoints. Large Language Models (LLMs) have shown to be capable of automatically creating system integrations (e.g., as service composition) based on this documentation but require concise input due to input token limitations, especially regarding comprehensive API descriptions. Currently, it is unknown how best to preprocess these API descriptions. Within this work, we (i) analyze the usage of Retrieval Augmented Generation (RAG) for endpoint discovery and the chunking, i.e., preprocessing, of state-of-practice OpenAPIs to reduce the input token length while preserving the most relevant information. To further reduce the input token length for the composition prompt and improve endpoint retrieval, we propose (ii) a Discovery Agent that only receives a summary of the most relevant endpoints and retrieves specification details on demand. We evaluate RAG for endpoint discovery using the RestBench benchmark, first, for the different chunking possibilities and parameters measuring the endpoint retrieval recall, precision, and F1 score. Then, we assess the Discovery Agent using the same test set. With our prototype, we demonstrate how to successfully employ RAG for endpoint discovery to reduce token count. While revealing high values for recall, precision, and F1, further research is necessary to retrieve all requisite endpoints. Our experiments show that for preprocessing, LLM-based and format-specific approaches outperform naïve chunking methods. Relying on an agent further enhances these results as the agent splits the tasks into multiple fine granular subtasks, improving the overall RAG performance in the token count, precision, and F1 score. Content: code.zip:Python source code to perform the experiments. evaluate.py: Script to execute the experiments (Uncomment lines to select the embedding model). socrag/*: Source code for the RAG. benchmark/*: RestBench specification. results.zip:Results of the RAG experiments (in the folder /results/data/ inside the zip file). Experiment results for the RAG: results_{embedding_model}_{top-k}.json. Experiment results for the Discovery Agent: results_{embedding_model}_{agent}_{refinement}_{llm}.json. FAISS store (intermediate data required for exact reproduction of results; one folder for each embedding model): bge_small, nvidia and oai. Intermediate data of the LLM-based refinement methods required for the exact reproduction of results: *_parser.json.
m
Large Language Model (LLM) Data | Machine Learning (ML) Data | AI Training...
data.mealme.ai
Updated Jan 23, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MealMe (2025). Large Language Model (LLM) Data | Machine Learning (ML) Data | AI Training Data (RAG) for 1M+ Global Grocery, Restaurant, and Retail Stores [Dataset]. https://data.mealme.ai/products/ai-training-data-rag-for-grocery-restaurant-and-retail-ra-mealme
Explore at:
Dataset updated
Jan 23, 2025
Dataset authored and provided by
MealMe
Area covered
Venezuela, Somalia, Wallis and Futuna, Sao Tome and Principe, Uzbekistan, South Sudan, Madagascar, Austria, Bosnia and Herzegovina, Greenland
Description
Comprehensive training data on 1M+ stores across the US & Canada. Includes detailed menus, inventory, pricing, and availability. Ideal for AI/ML models, powering retrieval-augmented generation, search, and personalization systems.
h
ovos-llm-augmented-intents
huggingface.co
Updated May 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenVoiceOS (2025). ovos-llm-augmented-intents [Dataset]. https://huggingface.co/datasets/OpenVoiceOS/ovos-llm-augmented-intents
Explore at:
Dataset updated
May 27, 2025
Dataset authored and provided by
OpenVoiceOS
Description
this is a dataset with LLM generated data to augment training data for OVOS intent classifiers new sentences may be added over time, mostly focused on the intents with few samples or that the intent classifier model is having trouble learning
f
LAURA: Enhancing Code Review Generation with Context-Enriched...
figshare.com
Updated Oct 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuxin Zhang; Yuxia Zhang; Zeyu Sun; Yanjie Jiang; Hui Liu (2025). LAURA: Enhancing Code Review Generation with Context-Enriched Retrieval-Augmented LLM [Dataset]. http://doi.org/10.6084/m9.figshare.27367194.v1
Explore at:
text/x-script.pythonAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27367194.v1
Dataset updated
Oct 3, 2025
Dataset provided by
figshare
Authors
Yuxin Zhang; Yuxia Zhang; Zeyu Sun; Yanjie Jiang; Hui Liu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LAURA: Enhancing Code Review Generation with Context-Enriched Retrieval-Augmented LLMIntroductionLAURA is an LLM-based retrieval-augmented, context-aware framework for code review generation, which integrates context augmentation, review exemplar retrieval, and prompt tuning to enhance the performance of LLMs (in our study, ChatGPT-4o and DeepSeek v3) in generating code review comments.The experiments show that LAURA outperforms the direct application of ChatGPT-4o and DeepSeek v3 for code review generation and significantly surpasses the performance of the pre-trained model CodeReviewer.Since our experiments are based on ChatGPT-4o and DeepSeek v3, we have released the data processing code and dataset used in our research. The code section includes the Python scripts we used for data collection, cleaning, merging, and retrieval. The dataset section contains 301k entries from 1,807 high-quality projects sourced from GitHub, covering four programming languages: C, C++, Java, and Python. We also provide the time-split dataset used as the retrieval database (which is also used for fine-tuning CodeReviewer) and the human-annotated evaluation dataset.File Structurecodes: Data collection, filtering and post-processing codes used in our studydata_collection_and_filtering.py: Code for collecting data via the GitHub GraphQL API and filtering with rule-based and LLM-based methodsdata_embedding.py: Code for data embeddingdata_merging.py: Code for data merging, used to merge the review comments with the same target diffdata_retrieval.py: Code for data retrievaldiff_extension.py: Code for extending the code diffs by integrating the full code contexts into the diffsdatasets: Datasets built and used in our studydatabase_for_retrieve.csv: The dataset we built for retrieval-augmented generation, containing 298,494 entries prior to December 26, 2024evaluation_data.csv: The evaluation dataset we manually annotated, containing 384 entries later than December 26, 2024full_dataset.csv: The full dataset we collected, containing 301,256 entriesprompts: The prompts used in data filtering, generation and evaluationdirect_generation.txt: The prompt we used for direct generation as baselinesLAURA_generation.txt: The prompt we used for LAURA generationLLM_evaluation.txt: The prompt we used for LLM evaluationLLM_filtering.txt: The prompt we used for LLM filtering in data filtering processREADME.md: Description of our submission
GNN-LLM Fusion: Node & Edge CSV Dataset
kaggle.com
zip
Updated Jun 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daksh Bhatnagar (2025). GNN-LLM Fusion: Node & Edge CSV Dataset [Dataset]. https://www.kaggle.com/datasets/dakshbhatnagar08/gnn-llm-fusion-node-and-edge-csv-dataset
Explore at:
zip(658 bytes)Available download formats
Dataset updated
Jun 14, 2025
Authors
Daksh Bhatnagar
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
GNN + LLM Hybrid Baseline Dataset

This dataset demonstrates how to fuse Large Language Model (LLM) generated embeddings with Graph Neural Networks (GNNs) for learning on tabular graph data.

Files Included

sample_nodes.csv – Node features including ID, category, and description text

sample_edges.csv – Edge list (source, target, weight)

sample_augmented_nodes.csv – Node features + LLM-generated embeddings (simulated)

GNN_LLM_Hybrid_Baseline.ipynb – Main baseline model using PyTorch Geometric

CSV_Processing_1.ipynb – Basic loading and EDA of nodes/edges

CSV_Processing_2.ipynb – Preview of LLM-augmented node features

Use Cases

Learn how to simulate and inject LLM embeddings into graph data

Experiment with hybrid GNN models for tabular reasoning

Use for educational purposes, benchmarking GNNs, or feature augmentation

Technologies Used

PyTorch Geometric

Scikit-learn

Numpy & Pandas

What’s Next?

This is a synthetic dataset. For real-world use: - Replace the "LLM embeddings" with outputs from OpenAI / Mistral / HuggingFace models - Extend the node descriptions with actual context or domain-specific text - Scale to real-world graphs or use with competition tabular datasets
50,000 Image Editing Datasets – Object Removal, Addition & Modification...
nexdata.ai
Updated Mar 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2025). 50,000 Image Editing Datasets – Object Removal, Addition & Modification Dataset for AI Training [Dataset]. https://www.nexdata.ai/datasets/llm/1785
Explore at:
Dataset updated
Mar 28, 2025
Dataset authored and provided by
Nexdata
Variables measured
Data size, Resolution, Object types, Accuracy rate, Editing types, Data parameters, Annotation content
Description
50,000 Sets - Image Editing Data. The editing types include human attribute editing, image semantic editing, and image structure editing. The editing targets cover scenes such as people, animals, goods, plants, and landscapes. In terms of annotation, based on the editing instructions, the targets that need to be edited in the image are edited. The data can be used for tasks such as image synthesis, data augmentation, and virtual scene generation.
AI-Essay-Detection-Daigt-V2-Dataset-with-typos
kaggle.com
zip
Updated Nov 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Murugesan Narayanaswamy (2023). AI-Essay-Detection-Daigt-V2-Dataset-with-typos [Dataset]. https://www.kaggle.com/datasets/murugesann/ai-essay-detection-daigt-v2-dataset-with-typos
Explore at:
zip(18463103 bytes)Available download formats
Dataset updated
Nov 23, 2023
Authors
Murugesan Narayanaswamy
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset is derived from these datasets - https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset & the dataset https://www.kaggle.com/datasets/jdragonxherrera/augmented-data-for-llm-detect-ai-generated-text/

Artificial errors are introduced into the above dataset in order to mimic the test data of the competition - "LLM - Detect AI Generated Text"

The notebook used for creating this dataset - https://www.kaggle.com/code/murugesann/daigt-v2-train-dataset-with-typos

USE ONLY VERSION 1 - IT IS TYPO INTRODUCED - IGNORE VERSION 2 AND 3

VERSION 2 & 3 ARE CORRECTED DATASETS (typos are corrected instead of being introduced - due to mistake !! :-)

~~You can pin to different versions as follows:

version 1 - daigt-v2-train-dataset - entire dataset is introduced with typos - can be used for training with typos version 2 - augmented-data-for-llm-detect-ai-generated-text is sampled with 9000 essays of equal label weight and introduced typos - can be used for CV testing version 3 - daigt-v2-train-dataset is split into train and test - only test is introduced with typos - can be used for CV testing~~
m
CodeLLMExp: An Annotated Dataset for Automated Vulnerability Localization...
data.mendeley.com
Updated Nov 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Omer FOTSO (2025). CodeLLMExp: An Annotated Dataset for Automated Vulnerability Localization and Explanation in AI-Generated Code [Dataset]. http://doi.org/10.17632/wxmnyrp668.1
Explore at:
Unique identifier
https://doi.org/10.17632/wxmnyrp668.1
Dataset updated
Nov 7, 2025
Authors
Omer FOTSO
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
CodeLLMExp is a comprehensive, large-scale, multi-language, and multi-vulnerability dataset created to advance research into the security of AI-generated code. It is specifically designed to train and evaluate machine learning models, such as Large Language Models (LLMs), on the joint tasks of Automated Vulnerability Localization (AVL) and Explainable AI (XAI).

The dataset was constructed through a rigorous pipeline that involved sourcing prompts from established security benchmarks (CodeLMSec, SecurityEval, Copilot CWE Scenarios), employing seed augmentation to ensure coverage of under-represented Common Weakness Enumerations (CWEs), and using a chain of LLMs to generate vulnerable code snippets. This raw data was then automatically evaluated for quality by an "LLM-as-judge" (validated against human experts with a Spearman correlation of 0.8545) and enriched with structured annotations.

CodeLLMExp covers three of the most widely used programming languages : Python, Java and C. It contains 10,400 high-quality examples across Python (44.3%), Java (29.6%), and C (26.1%). It focuses on 29 distinct CWEs, including the complete CWE Top 25 Most Dangerous Software Errors (2024. Each record in the dataset provides a vulnerable code snippet, the precise line number of the flaw, a structured explanation (root cause, impact, mitigation), and a fixed version of the code.

By providing richly annotated data for detection, classification, localization, and explanation, CodeLLMExp enables the development of more robust and transparent security analysis tools. It facilitates research into LLM adaptation strategies (e.g., prompting, fine-tuning, Retrieval-Augmented Generation), automated program repair, and the inherent security patterns of code produced by AI.
Data from: LLM-assisted Graph-RAG Information Extraction from IFC Data
figshare.com
pdf
Updated Apr 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hadeel Saadany (2025). LLM-assisted Graph-RAG Information Extraction from IFC Data [Dataset]. http://doi.org/10.6084/m9.figshare.28771409.v2
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28771409.v2
Dataset updated
Apr 23, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Hadeel Saadany
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In this research, we utilise the capabilities of LLMs to parse the IFC data with Graph Retrieval-Augmented Generation (Graph-RAG) technique to retrieve building object properties and their relations. We will show that, despite limitations due to the complex hierarchy of the IFC data, the Graph-RAG parsing enhances generative LLMs like GPT-4o with graph-based knowledge, enabling natural language query-response retrieval without the need for a complex pipeline.IFC data has become the general building information standard for collaborative work in the construction industry. However, IFC data can be very complicated because it allows for multiple ways to represent the same product information. In this research, we utilise the capabilities of LLMs to parse the IFC data with Graph Retrieval-Augmented Generation (Graph-RAG) technique to retrieve building object properties and their relations. We will show that, despite limitations due to the complex hierarchy of the IFC data, the Graph-RAG parsing enhances generative LLMs like GPT-4o with graph-based knowledge, enabling natural language query-response retrieval without the need for a complex pipeline.

Facebook

Twitter

Click to copy link

Link copied

Cite

Jonathan Herrera (2023). Augmented data for LLM - Detect AI Generated Text [Dataset]. https://www.kaggle.com/jdragonxherrera/augmented-data-for-llm-detect-ai-generated-text

Augmented data for LLM - Detect AI Generated Text

For https://www.kaggle.com/competitions/llm-detect-ai-generated-text competition

Explore at:

zip(328850388 bytes)Available download formats

Dataset updated

Nov 21, 2023

Authors

Jonathan Herrera

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

This dataset takes the original data from the following contributions:

Some of those already compile the others inside them, so I first removed the duplicates comparing by full text. After that the data augmentation took place, with a process composed of 2 steps that were iterated over and over, first I tried correcting typos on the texts by using language_tool_python, then I introduced noise the way the organizators seem to have done it (see https://www.kaggle.com/competitions/llm-detect-ai-generated-text/discussion/452279), then I corrected typos again, and repeat. After repeating these steps a couple of times I removed duplicates again comparing by full text.

The result is this dataset, it's split in train and test because I wanted to prevent information leaking between the train and test sections, so I did the steps independently on each of them (I split them before doing the data augmentation). If you don't care about train and test you can just concatenate both into a single dataset for training purposes.

If you find this dataset helpful, please upvote.

Clear search

Close search

Google apps

Main menu

Augmented data for LLM - Detect AI Generated Text

Data from: Leveraging LLM-Respondents for Item Evaluation: a Psychometric...

augmented-data-for-llm-detect-ai-generated-text

Dataset

Contents

Supplementary file 1_LLMCARE: early detection of cognitive impairment via...

Open-Source LLM Market Analysis, Size, and Forecast 2025-2029: North America...

Snapshot img

Supported data for manuscript "Can LLM-Augmented autonomous agents...

Dataset Scope

Scenarios and Metrics

Objectives of Comparison Experiments

Simulation Details (Applicable to All Experiments)

Prompt Injection Malignant

Prompt Injection

Dataset Overview

Column Description:

Augmented training data and labels, used for training the models

AMATC-LLM Augmented ArSarcasm-v2

augmented_dataset_llm_generated_NER

Multilingual LLM Market Analysis, Size, and Forecast 2025-2029: North...

Snapshot img

Replication Data for: Advanced System Integration: Analyzing OpenAPI...

Large Language Model (LLM) Data | Machine Learning (ML) Data | AI Training...

ovos-llm-augmented-intents

LAURA: Enhancing Code Review Generation with Context-Enriched...

GNN-LLM Fusion: Node & Edge CSV Dataset

GNN + LLM Hybrid Baseline Dataset

Files Included

Use Cases

Technologies Used

What’s Next?

50,000 Image Editing Datasets – Object Removal, Addition & Modification...

AI-Essay-Detection-Daigt-V2-Dataset-with-typos

CodeLLMExp: An Annotated Dataset for Automated Vulnerability Localization...

Data from: LLM-assisted Graph-RAG Information Extraction from IFC Data

Augmented data for LLM - Detect AI Generated Text

For https://www.kaggle.com/competitions/llm-detect-ai-generated-text competition