100+ datasets found

LLMs Data (2018-2024)
kaggle.com
zip
Updated May 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
jaina (2024). LLMs Data (2018-2024) [Dataset]. https://www.kaggle.com/datasets/jainaru/llms-data-2018-2024
Explore at:
zip(23351 bytes)Available download formats
Dataset updated
May 19, 2024
Authors
jaina
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Every major LLM and chatbot released since 2018 with company and no. of billion parameters used to train.

Data Columns

Model: The name of the language model.

Company: The company that developed the model.

Arch: The architecture of the model (e.g., Transformer, RNN).TBA means To Be Announced.

Parameters: The number of parameters (weights) in the model, which is a measure of its complexity. In Billions

Tokens: The number of tokens (sub-word units) the model can process or was trained on. Here, some values are TBA. In Billions

Ratio: Likely the ratio of parameters to tokens, or some other relevant ratio. In this table, it is specified only for Olympus as 20:01.

ALScore: ALScore is a quick and dirty rating of the model's power. The formula is: Sqr Root of (Parameters x Tokens).

Training dataset: The dataset used to train the model.

Release Date: The expected or actual release date of the model.

Notes: Additional notes about the model, such as training details or related information.

Playground: A URL link to a website where you can interact with the model or find more information about it.
Human vs. LLM Text Corpus
kaggle.com
zip
Updated Jan 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zachary Grinberg (2024). Human vs. LLM Text Corpus [Dataset]. https://www.kaggle.com/datasets/starblasters8/human-vs-llm-text-corpus
Explore at:
zip(2059496493 bytes)Available download formats
Dataset updated
Jan 10, 2024
Authors
Zachary Grinberg
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
I'm currently writing a research paper on AI Detection and its accuracy/effectiveness. While doing so, over the past few months I've generated a large amount of text using various LLMs. This is a dataset/corpus containing all of the data I generated/gathered as well as the text that was generated by various other users.

If you have any questions please post them on the Discussion page or contact me through Kaggle. Generating all of this took many hours of work and a few hundred dollars, all I ask in return is that you credit me if you find this dataset useful in your research. Also, an upvote would mean the world.

Ps. The picture is of my dog, Tessa, who passed away recently. I wasn't sure what to put as the picture so I thought that was better than nothing.

Here are the datasets I used in addition to the text I generated PLEASE UPVOTE THEM!:

https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset

https://www.kaggle.com/datasets/thedrcat/daigt-v3-train-dataset

https://www.kaggle.com/datasets/alejopaullier/daigt-external-dataset

https://www.kaggle.com/datasets/nbroad/persaude-corpus-2

https://www.kaggle.com/datasets/nbroad/daigt-data-llama-70b-and-falcon180b

https://www.kaggle.com/datasets/radek1/llm-generated-essays

https://www.kaggle.com/datasets/darraghdog/hello-claude-1000-essays-from-anthropic

https://www.kaggle.com/datasets/kingki19/llm-generated-essay-using-palm-from-google-gen-ai

https://www.kaggle.com/datasets/thedrcat/daigt-proper-train-dataset

https://www.kaggle.com/competitions/llm-detect-ai-generated-text/data

https://www.kaggle.com/competitions/feedback-prize-english-language-learning/data

https://www.kaggle.com/datasets/japkeeratsingh/ielts-writing

https://github.com/yafuly/DeepfakeTextDetect

https://huggingface.co/datasets/qwedsacf/ivypanda-essays

https://huggingface.co/datasets/nid989/EssayFroum-Dataset

https://huggingface.co/datasets/whateverweird17/essay_grade_v1

https://huggingface.co/datasets/dim/essayforum_raw_writing_10k

https://huggingface.co/datasets/ChristophSchuhmann/essays-with-instructions

https://huggingface.co/datasets/whateverweird17/essay_grade_v2
h
llm-japanese-dataset
huggingface.co
opendatalab.com
Updated Jan 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Izumi Lab. (2024). llm-japanese-dataset [Dataset]. https://huggingface.co/datasets/izumi-lab/llm-japanese-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 18, 2024
Dataset authored and provided by
Izumi Lab.
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
llm-japanese-dataset

LLM構築用の日本語インストラクション(チャット)データセット主に，英語で構築されたLLMモデルなどに対して，チャット(Instruction)応答タスクに関してLoRAなどでチューニングするために使用できます． ※様々な公開言語資源を利用させていただきました．関係各位にはこの場を借りて御礼申し上げます．

updates

2023/5/15にAlpaca datasetがNCにライセンス変更されたことに対応し，安心してご利用いただけるように，データセットから当該データセットをドロップしました． v1.0.1にて，ドロップ後のデータセットをご利用いただけます． 2024/1/4にWikipedia summaryに空白文字のみで構成される出力を削除することに対応し，Wikipediaのバージョンアップデート(20240101)をしました(v1.0.2)． 2024/1/18にAsian Language Treebank (ALT)データセットの欠損した出力を削除しました(v1.0.3)．… See the full description on the dataset page: https://huggingface.co/datasets/izumi-lab/llm-japanese-dataset.
h
GPT4-LLM-Cleaned
huggingface.co
opendatalab.com
Updated May 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Teknium (2023). GPT4-LLM-Cleaned [Dataset]. https://huggingface.co/datasets/teknium/GPT4-LLM-Cleaned
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 15, 2023
Authors
Teknium
Description
This is the GPT4-LLM dataset from : https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM It has been filtered of all OpenAI disclaimers and refusals. (Disclaimer: It may have removed some additional things besides just OAI disclaimers, as I used the followings script which is a bit more broad: https://huggingface.co/datasets/ehartford/WizardLM_alpaca_evol_instruct_70k_unfiltered/blob/main/wizardlm_clean.py) There is a modified script of that in the repo that was used specifically for… See the full description on the dataset page: https://huggingface.co/datasets/teknium/GPT4-LLM-Cleaned.
Comprehensive LLM Evaluation Dataset
kaggle.com
Updated Feb 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nezahat Korkmaz (2025). Comprehensive LLM Evaluation Dataset [Dataset]. https://www.kaggle.com/datasets/nezahatkk/llm-eval-tests-dataset-json
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 10, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Nezahat Korkmaz
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset contains structured test scenarios for evaluating Large Language Models (LLMs) using PromptFoo. It includes bias detection, robustness, hallucination, adversarial attacks, and security vulnerability tests. Some test cases are deliberately incorrect to analyze model resilience and error-handling capabilities. This dataset can be used for automated testing in CI/CD pipelines, model fine-tuning, and prompt optimization workflows.
📊 6.5k train examples for LLM Science Exam 📝
kaggle.com
Updated Jul 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Radek Osmulski (2023). 📊 6.5k train examples for LLM Science Exam 📝 [Dataset]. https://www.kaggle.com/datasets/radek1/additional-train-data-for-llm-science-exam
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 22, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Radek Osmulski
Description
I created this dataset using gpt-3.5-turbo.

I put a lot of effort into making this dataset high quality which allows you to achieve the highest score among the publicly available notebooks available at the moment! 🥳

Originally, I only uploaded 500 examples (they were used as train data in the notebook I mention above). They are stored in extra_train_set.csv.

I am now uploading another 6k (6000_train_examples.csv) completely new train examples which brings the total to 6.5k.

If you find this dataset useful, please leave an upvote! 😊 Thank you! 🙏🙏🙏
h
medical-vision-llm-dataset
huggingface.co
Updated Sep 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robail Yasrab (2025). medical-vision-llm-dataset [Dataset]. https://huggingface.co/datasets/robailleo/medical-vision-llm-dataset
Explore at:
Dataset updated
Sep 30, 2025
Authors
Robail Yasrab
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Combined Medical Vision-Language Dataset

Dataset Description

Comprehensive medical vision-language dataset with 4793 samples for vision-based LLM training.

Dataset Statistics

Total Samples: 4793 Training Samples: 3834 Validation Samples: 959

Modality Distribution

X-ray: 2325 samples CT: 1351 samples Unknown: 812 samples MRI: 231 samples Ultrasound: 70 samples Microscopy: 2 samples Endoscopy: 2 samples

Body Part Distribution

Unknown:… See the full description on the dataset page: https://huggingface.co/datasets/robailleo/medical-vision-llm-dataset.
Large Language Model (LLM) Comparisons
kaggle.com
zip
Updated Aug 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dylan Karmin (2023). Large Language Model (LLM) Comparisons [Dataset]. https://www.kaggle.com/datasets/dylankarmin/llm-datasets-comparison
Explore at:
zip(2596 bytes)Available download formats
Dataset updated
Aug 20, 2023
Authors
Dylan Karmin
Description
This dataset involves popular large language models (LLMs) that are used for deep learning and the training of artificial intelligence. These LLMs have different uses and data, so I decided to summarize and share information about each LLM. Please give credit to the creators or managers of the LLMs if you decide to use them for any purpose.
LLM - Detect AI Datamix
kaggle.com
zip
Updated Jan 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raja Biswas (2024). LLM - Detect AI Datamix [Dataset]. https://www.kaggle.com/datasets/conjuring92/ai-mix-v26
Explore at:
zip(172818297 bytes)Available download formats
Dataset updated
Jan 19, 2024
Authors
Raja Biswas
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This is the datamix created by Team 🔍 📝 🕵️‍♂️ 🤖 during the LLM - Detect AI Generated Text competition. This dataset helped us to win the competition. It facilitates a text-classification task to separate LLM generate essays from the student written ones.

It was developed in an incremental way focusing on size, diversity and complexity. For each datamix iteration, we attempted to plug blindspots of the previous generation models while maintaining robustness.

To maximally leverage in-domain human texts, we used the entire Persuade corpus comprising all 15 prompts. We also included diverse human texts from sources such as OpenAI GPT2 output dataset, ELLIPSE corpus, NarrativeQA, wikipedia, NLTK Brown corpus and IMDB movie reviews.

Sources for our generated essays can be grouped under four categories: - Proprietary LLMs (gpt-3.5, gpt-4, claude, cohere, gemini, palm) - Open source LLMs (llama, falcon, mistral, mixtral) - Existing LLM generated text datasets - Synthetic dataset made by T5 - DAIGT V2 subset - OUTFOX - Ghostbuster - gpt-2-output-dataset

Fine-tuned open-source LLMs (mistral, llama, falcon, deci-lm, t5, pythia, OPT, BLOOM, GPT2). For LLM fine-tuning, we leveraged the PERSUADE corpus in different ways:

Instruction tuning: Instructions were composed of different metadata e.g. prompt name, holistic essay score, ELL status and grade level. Responses were the corresponding student essays.

One topic held out: LLMs fine-tuned on PERSUADE essays with one prompt held out. When generating, only the held out prompt essays were generated. This was done to encourage new writing styles.

Span wise generation: Generate one span (discourse) at a time conditioned on the remaining essay.

We used a wide variety of generation configs and prompting strategies to promote diversity & complexity to the data. Generated essays leveraged a combination of the following: - Contrastive search - Use of Guidance scale, typical_p, suppress_tokens - High temperature & large values of top-k - Prompting to fill-in-the-blank: randomly mask words in an essay and asking LLM to reconstruct the original essay (similar to MLM) - Prompting without source texts - Prompting with source texts - Prompting to rewrite existing essays

Finally, we incorporated augmented essays to make our models aware of typical attacks on LLM content detection systems and obfuscations present in the provided training data. We mainly used a combination of the following augmentations on a random subset of essays: - Spelling correction - Deletion/insertion/swapping of characters - Replacement with synonym - Introduce obfuscations - Back translation - Random capitalization - Swap sentence
LLM-DataSet
kaggle.com
zip
Updated Jan 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ali (2024). LLM-DataSet [Dataset]. https://www.kaggle.com/datasets/asalhi/llm-dataset
Explore at:
zip(34674135 bytes)Available download formats
Dataset updated
Jan 22, 2024
Authors
Ali
Description
Dataset

This dataset was created by Ali

Contents
LLM Question-Answer Dataset
kaggle.com
zip
Updated Mar 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Unique Data (2024). LLM Question-Answer Dataset [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/llm-dataset/code
Explore at:
zip(543652 bytes)Available download formats
Dataset updated
Mar 6, 2024
Authors
Unique Data
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
LLM Dataset - Prompts and Generated Texts

The dataset contains prompts and texts generated by the Large Language Models (LLMs) in 32 different languages. The prompts are short sentences or phrases for the model to generate text. The texts generated by the LLM are responses to these prompts and can vary in length and complexity.

Researchers and developers can use this dataset to train and fine-tune their own language models for multilingual applications. The dataset provides a rich and diverse collection of outputs from the model, demonstrating its ability to generate coherent and contextually relevant text in multiple languages.

👉 Legally sourced datasets and carefully structured for AI training and model development. Explore samples from our dataset - Full dataset

Models used for text generation:

GPT-3.5,

GPT-4

Languages in the dataset:

Arabic, Azerbaijani, Catalan, Chinese, Czech, Danish, German, Greek, English, Esperanto, Spanish, Persian, Finnish, French, Irish, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Malayalam, Maratham, Netherlands, Polish, Portuguese, Portuguese (Brazil), Slovak, Swedish, Thai, Turkish, Ukrainian

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2Ff60c93f09ec82a765aa39678e4aa0a58%2Fsnapedit_1709731090855.jpeg?generation=1709738798916444&alt=media" alt="">

🧩 This is just an example of the data. Leave a request here to learn more

Content

CSV File includes the following data: - from_language: language the prompt is made in, - model: type of the model (GPT-3.5, GPT-4 and Uncensored GPT Version), - time: time when the answer was generated, - text: user prompt, - response: response generated by the model

🚀 You can learn more about our high-quality unique datasets here

keywords: dataset, machine learning, natural language processing, artificial intelligence, deep learning, neural networks, text generation, language models, openai, gpt-3, data science, predictive modeling, sentiment analysis, keyword extraction, text classification, sequence-to-sequence models, attention mechanisms, transformer architecture, word embeddings, glove embeddings, chatbots, question answering, language understanding, text mining, information retrieval, data preprocessing, feature engineering, explainable ai, model deployment
llm-dataset
kaggle.com
zip
Updated Jan 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
always try your best (2024). llm-dataset [Dataset]. https://www.kaggle.com/datasets/haroldlee02/llm-dataset
Explore at:
zip(70774176 bytes)Available download formats
Dataset updated
Jan 19, 2024
Authors
always try your best
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset

This dataset was created by always try your best

Released under Apache 2.0

Contents
h
CIVICS
huggingface.co
Updated May 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
llm-values (2024). CIVICS [Dataset]. https://huggingface.co/datasets/llm-values/CIVICS
Explore at:
Dataset updated
May 19, 2024
Dataset authored and provided by
llm-values
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Details

“CIVICS: Culturally-Informed & Values-Inclusive Corpus for Societal Impacts” is a dataset designed to evaluate the social and cultural variation of Large Language Models (LLMs) towards socially sensitive topics across multiple languages and cultures. The hand-crafted, multilingual dataset of statements addresses value-laden topics, including LGBTQI rights, social welfare, immigration, disability rights, and surrogacy. CIVICS is designed to elicit responses from LLMs… See the full description on the dataset page: https://huggingface.co/datasets/llm-values/CIVICS.
LLM: Mistral-7B Instruct texts
kaggle.com
zip
Updated Nov 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carl McBride Ellis (2023). LLM: Mistral-7B Instruct texts [Dataset]. https://www.kaggle.com/datasets/carlmcbrideellis/llm-mistral-7b-instruct-texts
Explore at:
zip(14702522 bytes)Available download formats
Dataset updated
Nov 29, 2023
Authors
Carl McBride Ellis
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset (specifically the file Mistral7B_CME_v7.csv) consists of 4900 LLM generated texts. (Note: versions 1 to 6 are redundant, and are only kept so as not to break any notebooks that use them)

Update: The new file Mistral7B_CME_v7_15_percent_corruption.csv has also been added as per the discussion "Alternative approach - Simulating hidden dataset".

v1: 700 LLM texts for prompt 6 "*Exploring Venus*" for use in the LLM - Detect AI Generated Text competition.

v2: + 700 LLM texts for prompt 8 "*The Face on Mars*"

v3: + 700 LLM texts for prompt 4 "*A Cowboy Who Rode the Waves*"

v4: + 700 LLM texts for prompt 11 "*Driverless cars*"

v5: + 700 LLM texts for prompt 7 "*Facial action coding system*"

v6: + 700 LLM texts for prompt 2 "*Car-free cities*"

v7: + 700 LLM texts for prompt 12 "*Does the electoral college work?*"

Photo credit: Image of Venus by NASA.
h
cti-llm-datasets
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Paolo, cti-llm-datasets [Dataset]. https://huggingface.co/datasets/priamai/cti-llm-datasets
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Paolo
Description
priamai/cti-llm-datasets dataset hosted on Hugging Face and contributed by the HF Datasets community
LLM-AI-Detect-Dataset-with-typos-2
kaggle.com
zip
Updated Nov 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Murugesan Narayanaswamy (2023). LLM-AI-Detect-Dataset-with-typos-2 [Dataset]. https://www.kaggle.com/datasets/murugesann/llm-ai-detect-dataset-with-typos-2
Explore at:
zip(30501283 bytes)Available download formats
Dataset updated
Nov 27, 2023
Authors
Murugesan Narayanaswamy
Description
This dataset is meant for use in LLM-Detect-AI-Generated-Text competition. It is derived from this dataset https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset/

It has 12% typos introduced into both human and LLM essays - using incorrect pyspell function that handles letter following apostrophe as typo
R
Dual Llm Dataset
universe.roboflow.com
zip
Updated Oct 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
twowheeledvehicle (2025). Dual Llm Dataset [Dataset]. https://universe.roboflow.com/twowheeledvehicle/dual-llm-eubyj
Explore at:
zipAvailable download formats
Dataset updated
Oct 9, 2025
Dataset authored and provided by
twowheeledvehicle
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Person Bounding Boxes
Description
Dual LLM

## Overview Dual LLM is a dataset for object detection tasks - it contains Person annotations for 1,234 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
10K rewritten texts dataset/LLM Prompt Recovery
kaggle.com
zip
Updated Apr 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aisha AL Mahmoud (2024). 10K rewritten texts dataset/LLM Prompt Recovery [Dataset]. https://www.kaggle.com/datasets/aishaalmahmoud/10k-rewritten-texts-datasetllm-prompt-recovery
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Apr 8, 2024
Authors
Aisha AL Mahmoud
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
About 10000 rewritten texts using Gemma 7b-it, the original texts from column "Support" in file train.csv from dataset SciQ (Scientific Question Answering)

if you find it useful, upvote it
h
arxiv-llm-papers-dataset
huggingface.co
Updated Jul 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
kammari santhosh (2024). arxiv-llm-papers-dataset [Dataset]. https://huggingface.co/datasets/santhoshkammari/arxiv-llm-papers-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 25, 2024
Authors
kammari santhosh
Description
santhoshkammari/arxiv-llm-papers-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

Open-Source LLM Market Analysis, Size, and Forecast 2025-2029: North America...

technavio.com

pdf

Updated Jul 10, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Technavio (2025). Open-Source LLM Market Analysis, Size, and Forecast 2025-2029: North America (US, Canada, and Mexico), Europe (France, Germany, and UK), APAC (China, India, Japan, and South Korea), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/open-source-llm-market-industry-analysis

Explore at:

pdfAvailable download formats

Dataset updated

Jul 10, 2025

Dataset provided by

TechNavio

Authors

Technavio

License

https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice

Time period covered

2025 - 2029

Area covered

Canada, United Kingdom, Germany, United States

Description

Snapshot img

Open-Source LLM Market Size 2025-2029

The open-source LLM market size is valued to increase by USD 54 billion, at a CAGR of 33.7% from 2024 to 2029. Increasing democratization and compelling economics will drive the open-source LLM market.

Market Insights

North America dominated the market and accounted for a 37% growth during the 2025-2029.
By Application - Technology and software segment was valued at USD 4.02 billion in 2023
By Deployment - On-premises segment accounted for the largest market revenue share in 2023

Market Size & Forecast

Market Opportunities: USD 575.60 million 
Market Future Opportunities 2024: USD 53995.50 million
CAGR from 2024 to 2029 : 33.7%

Market Summary

The Open-Source Large Language Model (LLM) market has experienced significant growth due to the increasing democratization of artificial intelligence (AI) technology and its compelling economics. This global trend is driven by the proliferation of smaller organizations seeking to leverage advanced language models for various applications, including supply chain optimization, compliance, and operational efficiency. Open-source LLMs offer several advantages over proprietary models. They provide greater flexibility, as users can modify and adapt the models to their specific needs. Additionally, open-source models often have larger training datasets, leading to improved performance and accuracy. However, there are challenges to implementing open-source LLMs, such as the prohibitive computational costs and critical hardware dependency. These obstacles necessitate the development of more efficient algorithms and the exploration of cloud computing solutions.
A real-world business scenario illustrates the potential benefits of open-source LLMs. A manufacturing company aims to optimize its supply chain by implementing an AI-powered system to analyze customer demand patterns and predict inventory needs. The company chooses an open-source LLM due to its flexibility and cost-effectiveness. By integrating the LLM into its supply chain management system, the company can improve forecasting accuracy and reduce inventory costs, ultimately increasing operational efficiency and customer satisfaction. Despite the challenges, the market continues to grow as organizations recognize the potential benefits of advanced language models. The democratization of AI technology and the compelling economics of open-source solutions make them an attractive option for businesses of all sizes.

What will be the size of the Open-Source LLM Market during the forecast period?

Get Key Insights on Market Forecast (PDF) Request Free Sample

The Open-Source Large Language Model (LLM) Market continues to evolve, offering businesses innovative solutions for various applications. One notable trend is the increasing adoption of explainable AI (XAI) methods in LLMs. XAI models provide transparency into the reasoning behind their outputs, addressing concerns around bias mitigation and interpretability. This transparency is crucial for industries with stringent compliance requirements, such as finance and healthcare. For instance, a recent study reveals that companies implementing XAI models have achieved a 25% increase in model acceptance rates among stakeholders, leading to more informed decisions. This improvement can significantly impact product strategy and budgeting, as businesses can confidently invest in AI solutions that align with their ethical and regulatory standards.
Moreover, advancements in LLM architecture include encoder-decoder architectures, multi-head attention, and self-attention layers, which enhance feature extraction and model scalability. These improvements contribute to better performance and more accurate results, making LLMs an essential tool for businesses seeking to optimize their operations and gain a competitive edge. In summary, the market is characterized by continuous innovation and a strong focus on delivering human-centric solutions. The adoption of explainable AI methods and advancements in neural network architecture are just a few examples of how businesses can benefit from these technologies. By investing in Open-Source LLMs, organizations can improve efficiency, enhance decision-making, and maintain a responsible approach to AI implementation.

Unpacking the Open-Source LLM Market Landscape

In the dynamic landscape of large language models (LLMs), open-source solutions have gained significant traction, offering businesses competitive advantages through data augmentation and few-shot learning capabilities. Compared to traditional models, open-source LLMs enable a 30% reduction in optimizer selection time and a 25% improvement in model accuracy for summarization tasks. Furthermore, distributed training and model compression techniques allow businesses to process larger training dataset sizes with minimal tokenization process disruptions, result

Facebook

Twitter

Click to copy link

Link copied

Cite

jaina (2024). LLMs Data (2018-2024) [Dataset]. https://www.kaggle.com/datasets/jainaru/llms-data-2018-2024

LLMs Data (2018-2024)

Every major LLM and chatbot released since 2018 with parameters, tokens, etc.

Explore at:

zip(23351 bytes)Available download formats

Dataset updated

May 19, 2024

Authors

jaina

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Every major LLM and chatbot released since 2018 with company and no. of billion parameters used to train.

Data Columns

Model: The name of the language model.
Company: The company that developed the model.
Arch: The architecture of the model (e.g., Transformer, RNN).TBA means To Be Announced.
Parameters: The number of parameters (weights) in the model, which is a measure of its complexity. In Billions
Tokens: The number of tokens (sub-word units) the model can process or was trained on. Here, some values are TBA. In Billions
Ratio: Likely the ratio of parameters to tokens, or some other relevant ratio. In this table, it is specified only for Olympus as 20:01.
ALScore: ALScore is a quick and dirty rating of the model's power. The formula is: Sqr Root of (Parameters x Tokens).
Training dataset: The dataset used to train the model.
Release Date: The expected or actual release date of the model.
Notes: Additional notes about the model, such as training details or related information.
Playground: A URL link to a website where you can interact with the model or find more information about it.

Clear search

Close search

Google apps

Main menu

LLMs Data (2018-2024)

Data Columns

Human vs. LLM Text Corpus

llm-japanese-dataset

GPT4-LLM-Cleaned

Comprehensive LLM Evaluation Dataset

📊 6.5k train examples for LLM Science Exam 📝

medical-vision-llm-dataset

Large Language Model (LLM) Comparisons

LLM - Detect AI Datamix

LLM-DataSet

Dataset

Contents

LLM Question-Answer Dataset

LLM Dataset - Prompts and Generated Texts

👉 Legally sourced datasets and carefully structured for AI training and model development. Explore samples from our dataset - Full dataset

Models used for text generation:

Languages in the dataset:

🧩 This is just an example of the data. Leave a request here to learn more

Content

llm-dataset

Dataset

Contents

CIVICS

LLM: Mistral-7B Instruct texts

cti-llm-datasets

LLM-AI-Detect-Dataset-with-typos-2

Dual Llm Dataset

Dual LLM

10K rewritten texts dataset/LLM Prompt Recovery

arxiv-llm-papers-dataset

Open-Source LLM Market Analysis, Size, and Forecast 2025-2029: North America...

Snapshot img

LLMs Data (2018-2024)

Every major LLM and chatbot released since 2018 with parameters, tokens, etc.

Data Columns