Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Every major LLM and chatbot released since 2018 with company and no. of billion parameters used to train.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
I'm currently writing a research paper on AI Detection and its accuracy/effectiveness. While doing so, over the past few months I've generated a large amount of text using various LLMs. This is a dataset/corpus containing all of the data I generated/gathered as well as the text that was generated by various other users.
If you have any questions please post them on the Discussion page or contact me through Kaggle. Generating all of this took many hours of work and a few hundred dollars, all I ask in return is that you credit me if you find this dataset useful in your research. Also, an upvote would mean the world.
Ps. The picture is of my dog, Tessa, who passed away recently. I wasn't sure what to put as the picture so I thought that was better than nothing.
Here are the datasets I used in addition to the text I generated PLEASE UPVOTE THEM!:
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
llm-japanese-dataset
LLM構築用の日本語インストラクション(チャット)データセット 主に,英語で構築されたLLMモデルなどに対して,チャット(Instruction)応答タスクに関してLoRAなどでチューニングするために使用できます. ※様々な公開言語資源を利用させていただきました.関係各位にはこの場を借りて御礼申し上げます.
updates
2023/5/15にAlpaca datasetがNCにライセンス変更されたことに対応し,安心してご利用いただけるように,データセットから当該データセットをドロップしました. v1.0.1にて,ドロップ後のデータセットをご利用いただけます. 2024/1/4にWikipedia summaryに空白文字のみで構成される出力を削除することに対応し,Wikipediaのバージョンアップデート(20240101)をしました(v1.0.2). 2024/1/18にAsian Language Treebank (ALT)データセットの欠損した出力を削除しました(v1.0.3).… See the full description on the dataset page: https://huggingface.co/datasets/izumi-lab/llm-japanese-dataset.
Facebook
TwitterThis is the GPT4-LLM dataset from : https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM It has been filtered of all OpenAI disclaimers and refusals. (Disclaimer: It may have removed some additional things besides just OAI disclaimers, as I used the followings script which is a bit more broad: https://huggingface.co/datasets/ehartford/WizardLM_alpaca_evol_instruct_70k_unfiltered/blob/main/wizardlm_clean.py) There is a modified script of that in the repo that was used specifically for… See the full description on the dataset page: https://huggingface.co/datasets/teknium/GPT4-LLM-Cleaned.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset contains structured test scenarios for evaluating Large Language Models (LLMs) using PromptFoo. It includes bias detection, robustness, hallucination, adversarial attacks, and security vulnerability tests. Some test cases are deliberately incorrect to analyze model resilience and error-handling capabilities. This dataset can be used for automated testing in CI/CD pipelines, model fine-tuning, and prompt optimization workflows.
Facebook
TwitterI created this dataset using gpt-3.5-turbo.
I put a lot of effort into making this dataset high quality which allows you to achieve the highest score among the publicly available notebooks available at the moment! 🥳
Originally, I only uploaded 500 examples (they were used as train data in the notebook I mention above). They are stored in extra_train_set.csv.
I am now uploading another 6k (6000_train_examples.csv) completely new train examples which brings the total to 6.5k.
If you find this dataset useful, please leave an upvote! 😊 Thank you! 🙏🙏🙏
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Combined Medical Vision-Language Dataset
Dataset Description
Comprehensive medical vision-language dataset with 4793 samples for vision-based LLM training.
Dataset Statistics
Total Samples: 4793 Training Samples: 3834 Validation Samples: 959
Modality Distribution
X-ray: 2325 samples CT: 1351 samples Unknown: 812 samples MRI: 231 samples Ultrasound: 70 samples Microscopy: 2 samples Endoscopy: 2 samples
Body Part Distribution
Unknown:… See the full description on the dataset page: https://huggingface.co/datasets/robailleo/medical-vision-llm-dataset.
Facebook
TwitterThis dataset involves popular large language models (LLMs) that are used for deep learning and the training of artificial intelligence. These LLMs have different uses and data, so I decided to summarize and share information about each LLM. Please give credit to the creators or managers of the LLMs if you decide to use them for any purpose.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This is the datamix created by Team 🔍 📝 🕵️♂️ 🤖 during the LLM - Detect AI Generated Text competition. This dataset helped us to win the competition. It facilitates a text-classification task to separate LLM generate essays from the student written ones.
It was developed in an incremental way focusing on size, diversity and complexity. For each datamix iteration, we attempted to plug blindspots of the previous generation models while maintaining robustness.
To maximally leverage in-domain human texts, we used the entire Persuade corpus comprising all 15 prompts. We also included diverse human texts from sources such as OpenAI GPT2 output dataset, ELLIPSE corpus, NarrativeQA, wikipedia, NLTK Brown corpus and IMDB movie reviews.
Sources for our generated essays can be grouped under four categories: - Proprietary LLMs (gpt-3.5, gpt-4, claude, cohere, gemini, palm) - Open source LLMs (llama, falcon, mistral, mixtral) - Existing LLM generated text datasets - Synthetic dataset made by T5 - DAIGT V2 subset - OUTFOX - Ghostbuster - gpt-2-output-dataset
We used a wide variety of generation configs and prompting strategies to promote diversity & complexity to the data. Generated essays leveraged a combination of the following: - Contrastive search - Use of Guidance scale, typical_p, suppress_tokens - High temperature & large values of top-k - Prompting to fill-in-the-blank: randomly mask words in an essay and asking LLM to reconstruct the original essay (similar to MLM) - Prompting without source texts - Prompting with source texts - Prompting to rewrite existing essays
Finally, we incorporated augmented essays to make our models aware of typical attacks on LLM content detection systems and obfuscations present in the provided training data. We mainly used a combination of the following augmentations on a random subset of essays: - Spelling correction - Deletion/insertion/swapping of characters - Replacement with synonym - Introduce obfuscations - Back translation - Random capitalization - Swap sentence
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The dataset contains prompts and texts generated by the Large Language Models (LLMs) in 32 different languages. The prompts are short sentences or phrases for the model to generate text. The texts generated by the LLM are responses to these prompts and can vary in length and complexity.
Researchers and developers can use this dataset to train and fine-tune their own language models for multilingual applications. The dataset provides a rich and diverse collection of outputs from the model, demonstrating its ability to generate coherent and contextually relevant text in multiple languages.
Arabic, Azerbaijani, Catalan, Chinese, Czech, Danish, German, Greek, English, Esperanto, Spanish, Persian, Finnish, French, Irish, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Malayalam, Maratham, Netherlands, Polish, Portuguese, Portuguese (Brazil), Slovak, Swedish, Thai, Turkish, Ukrainian
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2Ff60c93f09ec82a765aa39678e4aa0a58%2Fsnapedit_1709731090855.jpeg?generation=1709738798916444&alt=media" alt="">
CSV File includes the following data: - from_language: language the prompt is made in, - model: type of the model (GPT-3.5, GPT-4 and Uncensored GPT Version), - time: time when the answer was generated, - text: user prompt, - response: response generated by the model
🚀 You can learn more about our high-quality unique datasets here
keywords: dataset, machine learning, natural language processing, artificial intelligence, deep learning, neural networks, text generation, language models, openai, gpt-3, data science, predictive modeling, sentiment analysis, keyword extraction, text classification, sequence-to-sequence models, attention mechanisms, transformer architecture, word embeddings, glove embeddings, chatbots, question answering, language understanding, text mining, information retrieval, data preprocessing, feature engineering, explainable ai, model deployment
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by always try your best
Released under Apache 2.0
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Details
“CIVICS: Culturally-Informed & Values-Inclusive Corpus for Societal Impacts” is a dataset designed to evaluate the social and cultural variation of Large Language Models (LLMs) towards socially sensitive topics across multiple languages and cultures. The hand-crafted, multilingual dataset of statements addresses value-laden topics, including LGBTQI rights, social welfare, immigration, disability rights, and surrogacy. CIVICS is designed to elicit responses from LLMs… See the full description on the dataset page: https://huggingface.co/datasets/llm-values/CIVICS.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset (specifically the file Mistral7B_CME_v7.csv) consists of 4900 LLM generated texts.
(Note: versions 1 to 6 are redundant, and are only kept so as not to break any notebooks that use them)
Update: The new file Mistral7B_CME_v7_15_percent_corruption.csv has also been added as per the discussion "Alternative approach - Simulating hidden dataset".
v1: 700 LLM texts for prompt 6 "*Exploring Venus*" for use in the LLM - Detect AI Generated Text competition.
v2: + 700 LLM texts for prompt 8 "*The Face on Mars*"
v3: + 700 LLM texts for prompt 4 "*A Cowboy Who Rode the Waves*"
v4: + 700 LLM texts for prompt 11 "*Driverless cars*"
v5: + 700 LLM texts for prompt 7 "*Facial action coding system*"
v6: + 700 LLM texts for prompt 2 "*Car-free cities*"
v7: + 700 LLM texts for prompt 12 "*Does the electoral college work?*"
Photo credit: Image of Venus by NASA.
Facebook
Twitterpriamai/cti-llm-datasets dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterThis dataset is meant for use in LLM-Detect-AI-Generated-Text competition. It is derived from this dataset https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset/
It has 12% typos introduced into both human and LLM essays - using incorrect pyspell function that handles letter following apostrophe as typo
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Dual LLM is a dataset for object detection tasks - it contains Person annotations for 1,234 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
About 10000 rewritten texts using Gemma 7b-it, the original texts from column "Support" in file train.csv from dataset SciQ (Scientific Question Answering)
if you find it useful, upvote it
Facebook
Twittersanthoshkammari/arxiv-llm-papers-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
Open-Source LLM Market Size 2025-2029
The open-source LLM market size is valued to increase by USD 54 billion, at a CAGR of 33.7% from 2024 to 2029. Increasing democratization and compelling economics will drive the open-source LLM market.
Market Insights
North America dominated the market and accounted for a 37% growth during the 2025-2029.
By Application - Technology and software segment was valued at USD 4.02 billion in 2023
By Deployment - On-premises segment accounted for the largest market revenue share in 2023
Market Size & Forecast
Market Opportunities: USD 575.60 million
Market Future Opportunities 2024: USD 53995.50 million
CAGR from 2024 to 2029 : 33.7%
Market Summary
The Open-Source Large Language Model (LLM) market has experienced significant growth due to the increasing democratization of artificial intelligence (AI) technology and its compelling economics. This global trend is driven by the proliferation of smaller organizations seeking to leverage advanced language models for various applications, including supply chain optimization, compliance, and operational efficiency. Open-source LLMs offer several advantages over proprietary models. They provide greater flexibility, as users can modify and adapt the models to their specific needs. Additionally, open-source models often have larger training datasets, leading to improved performance and accuracy. However, there are challenges to implementing open-source LLMs, such as the prohibitive computational costs and critical hardware dependency. These obstacles necessitate the development of more efficient algorithms and the exploration of cloud computing solutions.
A real-world business scenario illustrates the potential benefits of open-source LLMs. A manufacturing company aims to optimize its supply chain by implementing an AI-powered system to analyze customer demand patterns and predict inventory needs. The company chooses an open-source LLM due to its flexibility and cost-effectiveness. By integrating the LLM into its supply chain management system, the company can improve forecasting accuracy and reduce inventory costs, ultimately increasing operational efficiency and customer satisfaction. Despite the challenges, the market continues to grow as organizations recognize the potential benefits of advanced language models. The democratization of AI technology and the compelling economics of open-source solutions make them an attractive option for businesses of all sizes.
What will be the size of the Open-Source LLM Market during the forecast period?
Get Key Insights on Market Forecast (PDF) Request Free Sample
The Open-Source Large Language Model (LLM) Market continues to evolve, offering businesses innovative solutions for various applications. One notable trend is the increasing adoption of explainable AI (XAI) methods in LLMs. XAI models provide transparency into the reasoning behind their outputs, addressing concerns around bias mitigation and interpretability. This transparency is crucial for industries with stringent compliance requirements, such as finance and healthcare. For instance, a recent study reveals that companies implementing XAI models have achieved a 25% increase in model acceptance rates among stakeholders, leading to more informed decisions. This improvement can significantly impact product strategy and budgeting, as businesses can confidently invest in AI solutions that align with their ethical and regulatory standards.
Moreover, advancements in LLM architecture include encoder-decoder architectures, multi-head attention, and self-attention layers, which enhance feature extraction and model scalability. These improvements contribute to better performance and more accurate results, making LLMs an essential tool for businesses seeking to optimize their operations and gain a competitive edge. In summary, the market is characterized by continuous innovation and a strong focus on delivering human-centric solutions. The adoption of explainable AI methods and advancements in neural network architecture are just a few examples of how businesses can benefit from these technologies. By investing in Open-Source LLMs, organizations can improve efficiency, enhance decision-making, and maintain a responsible approach to AI implementation.
Unpacking the Open-Source LLM Market Landscape
In the dynamic landscape of large language models (LLMs), open-source solutions have gained significant traction, offering businesses competitive advantages through data augmentation and few-shot learning capabilities. Compared to traditional models, open-source LLMs enable a 30% reduction in optimizer selection time and a 25% improvement in model accuracy for summarization tasks. Furthermore, distributed training and model compression techniques allow businesses to process larger training dataset sizes with minimal tokenization process disruptions, result
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Every major LLM and chatbot released since 2018 with company and no. of billion parameters used to train.