Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Every major LLM and chatbot released since 2018 with company and no. of billion parameters used to train.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
🤖 We curated this dataset for NeurIPS Large Language Model Efficiency Challenge: 1 LLM + 1GPU + 1Day. 🚀 Our Birbal-7B-V1 fine-tuned on this dataset achieved 🏆 first rank 🏆 in the competition.
Here is high-level diagram of our data preparation strategy:
Natural Instructions Dataset Preparation
Natural Instructionsdataset is a community effort to create a large collection of tasks and their natural language definitions/instructions. As show in above diagram, we sample from… See the full description on the dataset page: https://huggingface.co/datasets/upaya07/NeurIPS-LLM-data.
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global Data Lineage for LLM Training market size reached USD 1.29 billion in 2024, with an impressive compound annual growth rate (CAGR) of 21.8% expected through the forecast period. By 2033, the market is projected to grow to USD 8.93 billion, as organizations worldwide recognize the critical importance of robust data lineage solutions in ensuring transparency, compliance, and efficiency in large language model (LLM) training. The primary growth driver stems from the surging adoption of generative AI and LLMs across diverse industries, necessitating advanced data lineage capabilities for responsible and auditable AI development.
The exponential growth of the Data Lineage for LLM Training market is fundamentally driven by the increasing complexity and scale of data used in training modern AI models. As organizations deploy LLMs for a wide array of applications—from customer service automation to advanced analytics—the need for precise tracking of data provenance, transformation, and usage has become paramount. This trend is further amplified by the proliferation of multi-source and multi-format data, which significantly complicates the process of tracing data origins and transformations. Enterprises are investing heavily in data lineage solutions to ensure that their AI models are trained on high-quality, compliant, and auditable datasets, thereby reducing risks associated with data bias, inconsistency, and regulatory violations.
Another significant growth factor is the evolving regulatory landscape surrounding AI and data governance. Governments and regulatory bodies worldwide are introducing stringent guidelines for data usage, privacy, and accountability in AI systems. Regulations such as the European Union’s AI Act and the U.S. AI Bill of Rights are compelling organizations to implement comprehensive data lineage practices to demonstrate compliance and mitigate legal risks. This regulatory pressure is particularly pronounced in highly regulated industries such as banking, healthcare, and government, where the consequences of non-compliance can be financially and reputationally devastating. As a result, the demand for advanced data lineage software and services is surging, driving market expansion.
Technological advancements in data management platforms and the integration of AI-driven automation are further catalyzing the growth of the Data Lineage for LLM Training market. Modern data lineage tools now leverage machine learning and natural language processing to automatically map data flows, detect anomalies, and generate real-time lineage reports. These innovations drastically reduce the manual effort required for lineage documentation and enhance the scalability of lineage solutions across large and complex data environments. The continuous evolution of such technologies is enabling organizations to achieve higher levels of transparency, trust, and operational efficiency in their AI workflows, thereby fueling market growth.
Regionally, North America dominates the Data Lineage for LLM Training market, accounting for over 42% of the global market share in 2024. This dominance is attributed to the early adoption of AI technologies, the presence of leading technology vendors, and a mature regulatory environment. Europe follows closely, driven by strict data governance regulations and a rapidly growing AI ecosystem. The Asia Pacific region is witnessing the fastest growth, with a projected CAGR of 24.6% through 2033, fueled by digital transformation initiatives, increased AI investments, and a burgeoning startup landscape. Latin America and the Middle East & Africa are also emerging as promising markets, albeit at a relatively nascent stage.
The Data Lineage for LLM Training market is segmented by component into software and services, each playing a pivotal role in supporting organizations’ lineage initiatives. The software segment holds the largest market share, accounting for nearly 68% of the total market revenue in 2024. This dominance is primarily due to the widespread adoption of advanced data lineage platforms that offer features such as automated lineage mapping, visualization, impact analysis, and integration with existing data management and AI training workflows. These platforms are essential for organ
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Character-LLM: A Trainable Agent for Role-Playing
This is the training datasets for Character-LLM, which contains nine characters experience data used to train Character-LLMs. To download the dataset, please run the following code with Python, and you can find the downloaded data in /path/to/local_dir. from huggingface_hub import snapshot_download snapshot_download( local_dir_use_symlinks=True, repo_type="dataset", repo_id="fnlp/character-llm-data"… See the full description on the dataset page: https://huggingface.co/datasets/OpenMOSS-Team/character-llm-data.
Facebook
TwitterOverview
Volume: 2 Millions
Data use: Instruction-Following Evaluation for LLM
Data content: A variety of complex prompt instructions, between 50 and 400 words, with no fewer than 3 constraints in each prompt
Production method: All prompt are manually written to satisfy the diversity of coverage
Language: English, Korean, French, German, Spanish, Russian, Italian, Dutch, Polish, Portuguese, Japanese, Indonesian, Vietnamese
About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 3 million hours of Speech Data and 800TB of Computer Vision Data. These ready-to-go Machine Learning (ML) Data support instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/llm?source=Datarade
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Divyang Mandal
Released under CC0: Public Domain
Facebook
Twittergagan3012/long-llm-data dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterImage Description Data Data Size: 500 million pairs Image Type: generic scene(portrait, landscapes, animals,etc), human action, picture book, magazine, PPT&chart, App screenshot, and etc. Resolution: 4K+ Description Language: English, Spanish, Portuguese, French, Korean, German, Chinese, Japanese Description Length: text length is no less than 250 words Format: the image format is .jpg, the annotation format is .json, and the description format is .txt
Video Description Data Data Size: 10 million pairs Image Type: generic scene(portrait, landscapes, animals,etc), ads, TV sports, documentaries Resolution: 1080p+ Description Language: English, Spanish, Portuguese, French, Korean, German, Chinese, Japanese Description Length: text length is no less than 250 words Format: .mp4,.mov,.avi and other common formats;.xlsx (annotation file format)
About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 3 million hours of Speech Data and 800TB of Computer Vision Data. These ready-to-go data supports instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/llm?source=Datarade
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Hemanthh Velliyangirie
Released under Apache 2.0
Facebook
TwitterLarge Language Model (LLM) evaluation is currently one of the most important areas of research, with existing benchmarks proving to be insufficient and not completely representative of LLMs' various capabilities. We present a curated collection of challenging statements on sensitive topics for LLM benchmarking called TruthEval. These statements were curated by hand and contain known truth values. The categories were chosen to distinguish LLMs' abilities from their stochastic nature. Details of collection method and use cases can be found in this paper: TruthEval: A Dataset to Evaluate LLM Truthfulness and Reliability
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset includes the Extensive Tabular Format generated as part of a literature review. The table provides a structured, static representation of the data. A preliminary version of the review is currently available on arXiv : A Review of LLM-Assisted Ideation.For an interactive version, please visit the Online Format at: Notion Link.We hope this will inform and help future reviews and research in this area.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for "alpaca-gpt4-data"
All of the work is done by this team.
Usage and License Notices
The data is intended and licensed for research use only. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.
Chinese Dataset
Found here
Citation
@article{peng2023gpt4llm, title={Instruction Tuning with GPT-4}, author={Baolin Peng, Chunyuan Li… See the full description on the dataset page: https://huggingface.co/datasets/llm-wizard/alpaca-gpt4-data.
Facebook
TwitterOveriview Off-the-shelf 50 Million pre-training text data, covering test question, textbook, ebooks, journal and papers, multi-round dialog text and etc.
About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 1 million hours of Audio Data and 800TB of Annotated Imagery Data. These ready-to-go data supports instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/llm?source=Datarade
Facebook
TwitterFor the high-quality training data required in unsupervised learning and supervised learning, Nexdata provides flexible and customized Large Language Model(LLM) Data Data annotation services for tasks such as supervised fine-tuning (SFT) , and reinforcement learning from human feedback (RLHF).
Facebook
TwitterA June 2025 study found that ****** was the most frequently cited web domain by large language models (LLMs). The platform was referenced in approximately ** percent of the analyzed cases, likely due to the content licensing agreement between Google and Reddit in early 2024 for the purpose of AI models training. ********* ranked second, being mentioned in roughly ** percent of the times, while ****** and ******* were mentioned ** percent.
Facebook
Twitter300 Million Pairs of High-Quality Image-Caption Dataset includes a large-scale collection of photographic and vector images paired with English textual descriptions. The complete image library comprises nearly 300 million images, with a curated subset of 100 million high-quality image-caption pairs available for generative AI and vision-language model training. All images are authentic and legally licensed works created by professional photographers. The dataset primarily features English captions with minimal Chinese, offering diverse scenes, objects, and compositions suitable for tasks such as image captioning, visual question answering (VQA), image-text retrieval, and multimodal foundation model pretraining. The dataset supports large-scale LLM and VLM applications and complies with global data privacy and copyright regulations, including GDPR, CCPA, and PIPL.
Facebook
TwitterI created this dataset using gpt-3.5-turbo.
I put a lot of effort into making this dataset high quality which allows you to achieve the highest score among the publicly available notebooks available at the moment! 🥳
Originally, I only uploaded 500 examples (they were used as train data in the notebook I mention above). They are stored in extra_train_set.csv.
I am now uploading another 6k (6000_train_examples.csv) completely new train examples which brings the total to 6.5k.
If you find this dataset useful, please leave an upvote! 😊 Thank you! 🙏🙏🙏
Facebook
Twitterhttps://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The Data Annotation Tools Market size was valued at USD 1.31 billion in 2023 and is projected to reach USD 6.72 billion by 2032, exhibiting a CAGR of 26.3 % during the forecasts period. Recent developments include: In November 2023, Appen Limited, a high-quality data provider for the AI lifecycle, chose Amazon Web Services (AWS) as its primary cloud for AI solutions and innovation. As Appen utilizes additional enterprise solutions for AI data source, annotation, and model validation, the firms are expanding their collaboration with a multi-year deal. Appen is strengthening its AI data platform, which serves as the bridge between people and AI, by integrating cutting-edge AWS services. , In September 2023, Labelbox launched Large Language Model (LLM) solution to assist organizations in innovating with generative AI and deepen the partnership with Google Cloud. With the introduction of large language models (LLMs), enterprises now have a plethora of chances to generate new competitive advantages and commercial value. LLM systems have the ability to revolutionize a wide range of intelligent applications; nevertheless, in many cases, organizations will need to adjust or finetune LLMs in order to align with human preferences. Labelbox, as part of an expanded cooperation, is leveraging Google Cloud's generative AI capabilities to assist organizations in developing LLM solutions with Vertex AI. Labelbox's AI platform will be integrated with Google Cloud's leading AI and Data Cloud tools, including Vertex AI and Google Cloud's Model Garden repository, allowing ML teams to access cutting-edge machine learning (ML) models for vision and natural language processing (NLP) and automate key workflows. , In March 2023, has released the most recent version of Enlitic Curie, a platform aimed at improving radiology department workflow. This platform includes Curie|ENDEX, which uses natural language processing and computer vision to analyze and process medical images, and Curie|ENCOG, which uses artificial intelligence to detect and protect medical images in Health Information Security. , In November 2022, Appen Limited, a global leader in data for the AI Lifecycle, announced its partnership with CLEAR Global, a nonprofit organization dedicated to ensuring access to essential information and amplifying voices across languages. This collaboration aims to develop a speech-based healthcare FAQ bot tailored for Sheng, a Nairobi slang language. .
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Heba Mohamed
Released under CC0: Public Domain
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Grade-Aware LLM Training Dataset
Dataset Description
This dataset contains 1,107,690 high-quality instruction-tuning examples for grade-aware text simplification, designed for fine-tuning large language models to simplify text to specific reading grade levels with precision and semantic consistency.
Dataset Summary
Total Examples: 1,107,690 Task: Text simplification with precise grade-level targeting Language: English Grade Range: 1-12+ (precise 2-decimal… See the full description on the dataset page: https://huggingface.co/datasets/yimingwang123/grade-aware-llm-training-data.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Every major LLM and chatbot released since 2018 with company and no. of billion parameters used to train.