100+ datasets found
  1. LLMs Data (2018-2024)

    • kaggle.com
    zip
    Updated May 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    jaina (2024). LLMs Data (2018-2024) [Dataset]. https://www.kaggle.com/datasets/jainaru/llms-data-2018-2024
    Explore at:
    zip(23351 bytes)Available download formats
    Dataset updated
    May 19, 2024
    Authors
    jaina
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Every major LLM and chatbot released since 2018 with company and no. of billion parameters used to train.

    Data Columns

    1. Model: The name of the language model.
    2. Company: The company that developed the model.
    3. Arch: The architecture of the model (e.g., Transformer, RNN).TBA means To Be Announced.
    4. Parameters: The number of parameters (weights) in the model, which is a measure of its complexity. In Billions
    5. Tokens: The number of tokens (sub-word units) the model can process or was trained on. Here, some values are TBA. In Billions
    6. Ratio: Likely the ratio of parameters to tokens, or some other relevant ratio. In this table, it is specified only for Olympus as 20:01.
    7. ALScore: ALScore is a quick and dirty rating of the model's power. The formula is: Sqr Root of (Parameters x Tokens).
    8. Training dataset: The dataset used to train the model.
    9. Release Date: The expected or actual release date of the model.
    10. Notes: Additional notes about the model, such as training details or related information.
    11. Playground: A URL link to a website where you can interact with the model or find more information about it.
  2. Human vs. LLM Text Corpus

    • kaggle.com
    zip
    Updated Jan 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zachary Grinberg (2024). Human vs. LLM Text Corpus [Dataset]. https://www.kaggle.com/datasets/starblasters8/human-vs-llm-text-corpus
    Explore at:
    zip(2059496493 bytes)Available download formats
    Dataset updated
    Jan 10, 2024
    Authors
    Zachary Grinberg
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    I'm currently writing a research paper on AI Detection and its accuracy/effectiveness. While doing so, over the past few months I've generated a large amount of text using various LLMs. This is a dataset/corpus containing all of the data I generated/gathered as well as the text that was generated by various other users.

    If you have any questions please post them on the Discussion page or contact me through Kaggle. Generating all of this took many hours of work and a few hundred dollars, all I ask in return is that you credit me if you find this dataset useful in your research. Also, an upvote would mean the world.

    Ps. The picture is of my dog, Tessa, who passed away recently. I wasn't sure what to put as the picture so I thought that was better than nothing.

    Here are the datasets I used in addition to the text I generated PLEASE UPVOTE THEM!:

  3. h

    llm-japanese-dataset

    • huggingface.co
    • opendatalab.com
    Updated Jan 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Izumi Lab. (2024). llm-japanese-dataset [Dataset]. https://huggingface.co/datasets/izumi-lab/llm-japanese-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 18, 2024
    Dataset authored and provided by
    Izumi Lab.
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    llm-japanese-dataset

    LLM構築用の日本語インストラクション(チャット)データセット 主に,英語で構築されたLLMモデルなどに対して,チャット(Instruction)応答タスクに関してLoRAなどでチューニングするために使用できます. ※様々な公開言語資源を利用させていただきました.関係各位にはこの場を借りて御礼申し上げます.

      updates
    

    2023/5/15にAlpaca datasetがNCにライセンス変更されたことに対応し,安心してご利用いただけるように,データセットから当該データセットをドロップしました. v1.0.1にて,ドロップ後のデータセットをご利用いただけます. 2024/1/4にWikipedia summaryに空白文字のみで構成される出力を削除することに対応し,Wikipediaのバージョンアップデート(20240101)をしました(v1.0.2). 2024/1/18にAsian Language Treebank (ALT)データセットの欠損した出力を削除しました(v1.0.3).… See the full description on the dataset page: https://huggingface.co/datasets/izumi-lab/llm-japanese-dataset.

  4. h

    GPT4-LLM-Cleaned

    • huggingface.co
    • opendatalab.com
    Updated May 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Teknium (2023). GPT4-LLM-Cleaned [Dataset]. https://huggingface.co/datasets/teknium/GPT4-LLM-Cleaned
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 15, 2023
    Authors
    Teknium
    Description

    This is the GPT4-LLM dataset from : https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM It has been filtered of all OpenAI disclaimers and refusals. (Disclaimer: It may have removed some additional things besides just OAI disclaimers, as I used the followings script which is a bit more broad: https://huggingface.co/datasets/ehartford/WizardLM_alpaca_evol_instruct_70k_unfiltered/blob/main/wizardlm_clean.py) There is a modified script of that in the repo that was used specifically for… See the full description on the dataset page: https://huggingface.co/datasets/teknium/GPT4-LLM-Cleaned.

  5. Comprehensive LLM Evaluation Dataset

    • kaggle.com
    Updated Feb 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nezahat Korkmaz (2025). Comprehensive LLM Evaluation Dataset [Dataset]. https://www.kaggle.com/datasets/nezahatkk/llm-eval-tests-dataset-json
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 10, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Nezahat Korkmaz
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset contains structured test scenarios for evaluating Large Language Models (LLMs) using PromptFoo. It includes bias detection, robustness, hallucination, adversarial attacks, and security vulnerability tests. Some test cases are deliberately incorrect to analyze model resilience and error-handling capabilities. This dataset can be used for automated testing in CI/CD pipelines, model fine-tuning, and prompt optimization workflows.

  6. 📊 6.5k train examples for LLM Science Exam 📝

    • kaggle.com
    Updated Jul 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Radek Osmulski (2023). 📊 6.5k train examples for LLM Science Exam 📝 [Dataset]. https://www.kaggle.com/datasets/radek1/additional-train-data-for-llm-science-exam
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 22, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Radek Osmulski
    Description

    I created this dataset using gpt-3.5-turbo.

    I put a lot of effort into making this dataset high quality which allows you to achieve the highest score among the publicly available notebooks available at the moment! 🥳

    Originally, I only uploaded 500 examples (they were used as train data in the notebook I mention above). They are stored in extra_train_set.csv.

    I am now uploading another 6k (6000_train_examples.csv) completely new train examples which brings the total to 6.5k.

    If you find this dataset useful, please leave an upvote! 😊 Thank you! 🙏🙏🙏

  7. h

    medical-vision-llm-dataset

    • huggingface.co
    Updated Sep 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robail Yasrab (2025). medical-vision-llm-dataset [Dataset]. https://huggingface.co/datasets/robailleo/medical-vision-llm-dataset
    Explore at:
    Dataset updated
    Sep 30, 2025
    Authors
    Robail Yasrab
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Combined Medical Vision-Language Dataset

      Dataset Description
    

    Comprehensive medical vision-language dataset with 4793 samples for vision-based LLM training.

      Dataset Statistics
    

    Total Samples: 4793 Training Samples: 3834 Validation Samples: 959

      Modality Distribution
    

    X-ray: 2325 samples CT: 1351 samples Unknown: 812 samples MRI: 231 samples Ultrasound: 70 samples Microscopy: 2 samples Endoscopy: 2 samples

      Body Part Distribution
    

    Unknown:… See the full description on the dataset page: https://huggingface.co/datasets/robailleo/medical-vision-llm-dataset.

  8. Large Language Model (LLM) Comparisons

    • kaggle.com
    zip
    Updated Aug 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dylan Karmin (2023). Large Language Model (LLM) Comparisons [Dataset]. https://www.kaggle.com/datasets/dylankarmin/llm-datasets-comparison
    Explore at:
    zip(2596 bytes)Available download formats
    Dataset updated
    Aug 20, 2023
    Authors
    Dylan Karmin
    Description

    This dataset involves popular large language models (LLMs) that are used for deep learning and the training of artificial intelligence. These LLMs have different uses and data, so I decided to summarize and share information about each LLM. Please give credit to the creators or managers of the LLMs if you decide to use them for any purpose.

  9. LLM - Detect AI Datamix

    • kaggle.com
    zip
    Updated Jan 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raja Biswas (2024). LLM - Detect AI Datamix [Dataset]. https://www.kaggle.com/datasets/conjuring92/ai-mix-v26
    Explore at:
    zip(172818297 bytes)Available download formats
    Dataset updated
    Jan 19, 2024
    Authors
    Raja Biswas
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This is the datamix created by Team 🔍 📝 🕵️‍♂️ 🤖 during the LLM - Detect AI Generated Text competition. This dataset helped us to win the competition. It facilitates a text-classification task to separate LLM generate essays from the student written ones.

    It was developed in an incremental way focusing on size, diversity and complexity. For each datamix iteration, we attempted to plug blindspots of the previous generation models while maintaining robustness.

    To maximally leverage in-domain human texts, we used the entire Persuade corpus comprising all 15 prompts. We also included diverse human texts from sources such as OpenAI GPT2 output dataset, ELLIPSE corpus, NarrativeQA, wikipedia, NLTK Brown corpus and IMDB movie reviews.

    Sources for our generated essays can be grouped under four categories: - Proprietary LLMs (gpt-3.5, gpt-4, claude, cohere, gemini, palm) - Open source LLMs (llama, falcon, mistral, mixtral) - Existing LLM generated text datasets - Synthetic dataset made by T5 - DAIGT V2 subset - OUTFOX - Ghostbuster - gpt-2-output-dataset

    • Fine-tuned open-source LLMs (mistral, llama, falcon, deci-lm, t5, pythia, OPT, BLOOM, GPT2). For LLM fine-tuning, we leveraged the PERSUADE corpus in different ways:
      • Instruction tuning: Instructions were composed of different metadata e.g. prompt name, holistic essay score, ELL status and grade level. Responses were the corresponding student essays.
      • One topic held out: LLMs fine-tuned on PERSUADE essays with one prompt held out. When generating, only the held out prompt essays were generated. This was done to encourage new writing styles.
      • Span wise generation: Generate one span (discourse) at a time conditioned on the remaining essay.

    We used a wide variety of generation configs and prompting strategies to promote diversity & complexity to the data. Generated essays leveraged a combination of the following: - Contrastive search - Use of Guidance scale, typical_p, suppress_tokens - High temperature & large values of top-k - Prompting to fill-in-the-blank: randomly mask words in an essay and asking LLM to reconstruct the original essay (similar to MLM) - Prompting without source texts - Prompting with source texts - Prompting to rewrite existing essays

    Finally, we incorporated augmented essays to make our models aware of typical attacks on LLM content detection systems and obfuscations present in the provided training data. We mainly used a combination of the following augmentations on a random subset of essays: - Spelling correction - Deletion/insertion/swapping of characters - Replacement with synonym - Introduce obfuscations - Back translation - Random capitalization - Swap sentence

  10. LLM-DataSet

    • kaggle.com
    zip
    Updated Jan 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ali (2024). LLM-DataSet [Dataset]. https://www.kaggle.com/datasets/asalhi/llm-dataset
    Explore at:
    zip(34674135 bytes)Available download formats
    Dataset updated
    Jan 22, 2024
    Authors
    Ali
    Description

    Dataset

    This dataset was created by Ali

    Contents

  11. LLM Question-Answer Dataset

    • kaggle.com
    zip
    Updated Mar 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unique Data (2024). LLM Question-Answer Dataset [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/llm-dataset/code
    Explore at:
    zip(543652 bytes)Available download formats
    Dataset updated
    Mar 6, 2024
    Authors
    Unique Data
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    LLM Dataset - Prompts and Generated Texts

    The dataset contains prompts and texts generated by the Large Language Models (LLMs) in 32 different languages. The prompts are short sentences or phrases for the model to generate text. The texts generated by the LLM are responses to these prompts and can vary in length and complexity.

    Researchers and developers can use this dataset to train and fine-tune their own language models for multilingual applications. The dataset provides a rich and diverse collection of outputs from the model, demonstrating its ability to generate coherent and contextually relevant text in multiple languages.

    👉 Legally sourced datasets and carefully structured for AI training and model development. Explore samples from our dataset - Full dataset

    Models used for text generation:

    • GPT-3.5,
    • GPT-4

    Languages in the dataset:

    Arabic, Azerbaijani, Catalan, Chinese, Czech, Danish, German, Greek, English, Esperanto, Spanish, Persian, Finnish, French, Irish, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Malayalam, Maratham, Netherlands, Polish, Portuguese, Portuguese (Brazil), Slovak, Swedish, Thai, Turkish, Ukrainian

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2Ff60c93f09ec82a765aa39678e4aa0a58%2Fsnapedit_1709731090855.jpeg?generation=1709738798916444&alt=media" alt="">

    🧩 This is just an example of the data. Leave a request here to learn more

    Content

    CSV File includes the following data: - from_language: language the prompt is made in, - model: type of the model (GPT-3.5, GPT-4 and Uncensored GPT Version), - time: time when the answer was generated, - text: user prompt, - response: response generated by the model

    🚀 You can learn more about our high-quality unique datasets here

    keywords: dataset, machine learning, natural language processing, artificial intelligence, deep learning, neural networks, text generation, language models, openai, gpt-3, data science, predictive modeling, sentiment analysis, keyword extraction, text classification, sequence-to-sequence models, attention mechanisms, transformer architecture, word embeddings, glove embeddings, chatbots, question answering, language understanding, text mining, information retrieval, data preprocessing, feature engineering, explainable ai, model deployment

  12. llm-dataset

    • kaggle.com
    zip
    Updated Jan 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    always try your best (2024). llm-dataset [Dataset]. https://www.kaggle.com/datasets/haroldlee02/llm-dataset
    Explore at:
    zip(70774176 bytes)Available download formats
    Dataset updated
    Jan 19, 2024
    Authors
    always try your best
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by always try your best

    Released under Apache 2.0

    Contents

  13. h

    CIVICS

    • huggingface.co
    Updated May 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    llm-values (2024). CIVICS [Dataset]. https://huggingface.co/datasets/llm-values/CIVICS
    Explore at:
    Dataset updated
    May 19, 2024
    Dataset authored and provided by
    llm-values
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Details

    “CIVICS: Culturally-Informed & Values-Inclusive Corpus for Societal Impacts” is a dataset designed to evaluate the social and cultural variation of Large Language Models (LLMs) towards socially sensitive topics across multiple languages and cultures. The hand-crafted, multilingual dataset of statements addresses value-laden topics, including LGBTQI rights, social welfare, immigration, disability rights, and surrogacy. CIVICS is designed to elicit responses from LLMs… See the full description on the dataset page: https://huggingface.co/datasets/llm-values/CIVICS.

  14. LLM: Mistral-7B Instruct texts

    • kaggle.com
    zip
    Updated Nov 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carl McBride Ellis (2023). LLM: Mistral-7B Instruct texts [Dataset]. https://www.kaggle.com/datasets/carlmcbrideellis/llm-mistral-7b-instruct-texts
    Explore at:
    zip(14702522 bytes)Available download formats
    Dataset updated
    Nov 29, 2023
    Authors
    Carl McBride Ellis
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset (specifically the file Mistral7B_CME_v7.csv) consists of 4900 LLM generated texts. (Note: versions 1 to 6 are redundant, and are only kept so as not to break any notebooks that use them)

    Update: The new file Mistral7B_CME_v7_15_percent_corruption.csv has also been added as per the discussion "Alternative approach - Simulating hidden dataset".

    v1: 700 LLM texts for prompt 6 "*Exploring Venus*" for use in the LLM - Detect AI Generated Text competition.

    v2: + 700 LLM texts for prompt 8 "*The Face on Mars*"

    v3: + 700 LLM texts for prompt 4 "*A Cowboy Who Rode the Waves*"

    v4: + 700 LLM texts for prompt 11 "*Driverless cars*"

    v5: + 700 LLM texts for prompt 7 "*Facial action coding system*"

    v6: + 700 LLM texts for prompt 2 "*Car-free cities*"

    v7: + 700 LLM texts for prompt 12 "*Does the electoral college work?*"

    Photo credit: Image of Venus by NASA.

  15. h

    cti-llm-datasets

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Paolo, cti-llm-datasets [Dataset]. https://huggingface.co/datasets/priamai/cti-llm-datasets
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Paolo
    Description

    priamai/cti-llm-datasets dataset hosted on Hugging Face and contributed by the HF Datasets community

  16. LLM-AI-Detect-Dataset-with-typos-2

    • kaggle.com
    zip
    Updated Nov 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Murugesan Narayanaswamy (2023). LLM-AI-Detect-Dataset-with-typos-2 [Dataset]. https://www.kaggle.com/datasets/murugesann/llm-ai-detect-dataset-with-typos-2
    Explore at:
    zip(30501283 bytes)Available download formats
    Dataset updated
    Nov 27, 2023
    Authors
    Murugesan Narayanaswamy
    Description

    This dataset is meant for use in LLM-Detect-AI-Generated-Text competition. It is derived from this dataset https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset/

    It has 12% typos introduced into both human and LLM essays - using incorrect pyspell function that handles letter following apostrophe as typo

  17. R

    Dual Llm Dataset

    • universe.roboflow.com
    zip
    Updated Oct 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    twowheeledvehicle (2025). Dual Llm Dataset [Dataset]. https://universe.roboflow.com/twowheeledvehicle/dual-llm-eubyj
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 9, 2025
    Dataset authored and provided by
    twowheeledvehicle
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Person Bounding Boxes
    Description

    Dual LLM

    ## Overview
    
    Dual LLM is a dataset for object detection tasks - it contains Person annotations for 1,234 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  18. 10K rewritten texts dataset/LLM Prompt Recovery

    • kaggle.com
    zip
    Updated Apr 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aisha AL Mahmoud (2024). 10K rewritten texts dataset/LLM Prompt Recovery [Dataset]. https://www.kaggle.com/datasets/aishaalmahmoud/10k-rewritten-texts-datasetllm-prompt-recovery
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Apr 8, 2024
    Authors
    Aisha AL Mahmoud
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    About 10000 rewritten texts using Gemma 7b-it, the original texts from column "Support" in file train.csv from dataset SciQ (Scientific Question Answering)

    if you find it useful, upvote it

  19. h

    arxiv-llm-papers-dataset

    • huggingface.co
    Updated Jul 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    kammari santhosh (2024). arxiv-llm-papers-dataset [Dataset]. https://huggingface.co/datasets/santhoshkammari/arxiv-llm-papers-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 25, 2024
    Authors
    kammari santhosh
    Description

    santhoshkammari/arxiv-llm-papers-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  20. Open-Source LLM Market Analysis, Size, and Forecast 2025-2029: North America...

    • technavio.com
    pdf
    Updated Jul 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2025). Open-Source LLM Market Analysis, Size, and Forecast 2025-2029: North America (US, Canada, and Mexico), Europe (France, Germany, and UK), APAC (China, India, Japan, and South Korea), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/open-source-llm-market-industry-analysis
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jul 10, 2025
    Dataset provided by
    TechNavio
    Authors
    Technavio
    License

    https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice

    Time period covered
    2025 - 2029
    Area covered
    Canada, United Kingdom, Germany, United States
    Description

    Snapshot img

    Open-Source LLM Market Size 2025-2029

    The open-source LLM market size is valued to increase by USD 54 billion, at a CAGR of 33.7% from 2024 to 2029. Increasing democratization and compelling economics will drive the open-source LLM market.

    Market Insights

    North America dominated the market and accounted for a 37% growth during the 2025-2029.
    By Application - Technology and software segment was valued at USD 4.02 billion in 2023
    By Deployment - On-premises segment accounted for the largest market revenue share in 2023
    

    Market Size & Forecast

    Market Opportunities: USD 575.60 million 
    Market Future Opportunities 2024: USD 53995.50 million
    CAGR from 2024 to 2029 : 33.7%
    

    Market Summary

    The Open-Source Large Language Model (LLM) market has experienced significant growth due to the increasing democratization of artificial intelligence (AI) technology and its compelling economics. This global trend is driven by the proliferation of smaller organizations seeking to leverage advanced language models for various applications, including supply chain optimization, compliance, and operational efficiency. Open-source LLMs offer several advantages over proprietary models. They provide greater flexibility, as users can modify and adapt the models to their specific needs. Additionally, open-source models often have larger training datasets, leading to improved performance and accuracy. However, there are challenges to implementing open-source LLMs, such as the prohibitive computational costs and critical hardware dependency. These obstacles necessitate the development of more efficient algorithms and the exploration of cloud computing solutions.
    A real-world business scenario illustrates the potential benefits of open-source LLMs. A manufacturing company aims to optimize its supply chain by implementing an AI-powered system to analyze customer demand patterns and predict inventory needs. The company chooses an open-source LLM due to its flexibility and cost-effectiveness. By integrating the LLM into its supply chain management system, the company can improve forecasting accuracy and reduce inventory costs, ultimately increasing operational efficiency and customer satisfaction. Despite the challenges, the market continues to grow as organizations recognize the potential benefits of advanced language models. The democratization of AI technology and the compelling economics of open-source solutions make them an attractive option for businesses of all sizes.
    

    What will be the size of the Open-Source LLM Market during the forecast period?

    Get Key Insights on Market Forecast (PDF) Request Free Sample

    The Open-Source Large Language Model (LLM) Market continues to evolve, offering businesses innovative solutions for various applications. One notable trend is the increasing adoption of explainable AI (XAI) methods in LLMs. XAI models provide transparency into the reasoning behind their outputs, addressing concerns around bias mitigation and interpretability. This transparency is crucial for industries with stringent compliance requirements, such as finance and healthcare. For instance, a recent study reveals that companies implementing XAI models have achieved a 25% increase in model acceptance rates among stakeholders, leading to more informed decisions. This improvement can significantly impact product strategy and budgeting, as businesses can confidently invest in AI solutions that align with their ethical and regulatory standards.
    Moreover, advancements in LLM architecture include encoder-decoder architectures, multi-head attention, and self-attention layers, which enhance feature extraction and model scalability. These improvements contribute to better performance and more accurate results, making LLMs an essential tool for businesses seeking to optimize their operations and gain a competitive edge. In summary, the market is characterized by continuous innovation and a strong focus on delivering human-centric solutions. The adoption of explainable AI methods and advancements in neural network architecture are just a few examples of how businesses can benefit from these technologies. By investing in Open-Source LLMs, organizations can improve efficiency, enhance decision-making, and maintain a responsible approach to AI implementation.
    

    Unpacking the Open-Source LLM Market Landscape

    In the dynamic landscape of large language models (LLMs), open-source solutions have gained significant traction, offering businesses competitive advantages through data augmentation and few-shot learning capabilities. Compared to traditional models, open-source LLMs enable a 30% reduction in optimizer selection time and a 25% improvement in model accuracy for summarization tasks. Furthermore, distributed training and model compression techniques allow businesses to process larger training dataset sizes with minimal tokenization process disruptions, result

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
jaina (2024). LLMs Data (2018-2024) [Dataset]. https://www.kaggle.com/datasets/jainaru/llms-data-2018-2024
Organization logo

LLMs Data (2018-2024)

Every major LLM and chatbot released since 2018 with parameters, tokens, etc.

Explore at:
zip(23351 bytes)Available download formats
Dataset updated
May 19, 2024
Authors
jaina
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Every major LLM and chatbot released since 2018 with company and no. of billion parameters used to train.

Data Columns

  1. Model: The name of the language model.
  2. Company: The company that developed the model.
  3. Arch: The architecture of the model (e.g., Transformer, RNN).TBA means To Be Announced.
  4. Parameters: The number of parameters (weights) in the model, which is a measure of its complexity. In Billions
  5. Tokens: The number of tokens (sub-word units) the model can process or was trained on. Here, some values are TBA. In Billions
  6. Ratio: Likely the ratio of parameters to tokens, or some other relevant ratio. In this table, it is specified only for Olympus as 20:01.
  7. ALScore: ALScore is a quick and dirty rating of the model's power. The formula is: Sqr Root of (Parameters x Tokens).
  8. Training dataset: The dataset used to train the model.
  9. Release Date: The expected or actual release date of the model.
  10. Notes: Additional notes about the model, such as training details or related information.
  11. Playground: A URL link to a website where you can interact with the model or find more information about it.
Search
Clear search
Close search
Google apps
Main menu