100+ datasets found
  1. LLMs Data (2018-2024)

    • kaggle.com
    zip
    Updated May 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    jaina (2024). LLMs Data (2018-2024) [Dataset]. https://www.kaggle.com/datasets/jainaru/llms-data-2018-2024
    Explore at:
    zip(23351 bytes)Available download formats
    Dataset updated
    May 19, 2024
    Authors
    jaina
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Every major LLM and chatbot released since 2018 with company and no. of billion parameters used to train.

    Data Columns

    1. Model: The name of the language model.
    2. Company: The company that developed the model.
    3. Arch: The architecture of the model (e.g., Transformer, RNN).TBA means To Be Announced.
    4. Parameters: The number of parameters (weights) in the model, which is a measure of its complexity. In Billions
    5. Tokens: The number of tokens (sub-word units) the model can process or was trained on. Here, some values are TBA. In Billions
    6. Ratio: Likely the ratio of parameters to tokens, or some other relevant ratio. In this table, it is specified only for Olympus as 20:01.
    7. ALScore: ALScore is a quick and dirty rating of the model's power. The formula is: Sqr Root of (Parameters x Tokens).
    8. Training dataset: The dataset used to train the model.
    9. Release Date: The expected or actual release date of the model.
    10. Notes: Additional notes about the model, such as training details or related information.
    11. Playground: A URL link to a website where you can interact with the model or find more information about it.
  2. h

    NeurIPS-LLM-data

    • huggingface.co
    Updated Mar 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Upaya (2024). NeurIPS-LLM-data [Dataset]. https://huggingface.co/datasets/upaya07/NeurIPS-LLM-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 4, 2024
    Dataset authored and provided by
    Upaya
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    🤖 We curated this dataset for NeurIPS Large Language Model Efficiency Challenge: 1 LLM + 1GPU + 1Day. 🚀 Our Birbal-7B-V1 fine-tuned on this dataset achieved 🏆 first rank 🏆 in the competition.

    Here is high-level diagram of our data preparation strategy:

      Natural Instructions Dataset Preparation
    

    Natural Instructionsdataset is a community effort to create a large collection of tasks and their natural language definitions/instructions. As show in above diagram, we sample from… See the full description on the dataset page: https://huggingface.co/datasets/upaya07/NeurIPS-LLM-data.

  3. D

    Data Lineage For LLM Training Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Sep 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Data Lineage For LLM Training Market Research Report 2033 [Dataset]. https://dataintelo.com/report/data-lineage-for-llm-training-market
    Explore at:
    pdf, pptx, csvAvailable download formats
    Dataset updated
    Sep 30, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Data Lineage for LLM Training Market Outlook




    According to our latest research, the global Data Lineage for LLM Training market size reached USD 1.29 billion in 2024, with an impressive compound annual growth rate (CAGR) of 21.8% expected through the forecast period. By 2033, the market is projected to grow to USD 8.93 billion, as organizations worldwide recognize the critical importance of robust data lineage solutions in ensuring transparency, compliance, and efficiency in large language model (LLM) training. The primary growth driver stems from the surging adoption of generative AI and LLMs across diverse industries, necessitating advanced data lineage capabilities for responsible and auditable AI development.




    The exponential growth of the Data Lineage for LLM Training market is fundamentally driven by the increasing complexity and scale of data used in training modern AI models. As organizations deploy LLMs for a wide array of applications—from customer service automation to advanced analytics—the need for precise tracking of data provenance, transformation, and usage has become paramount. This trend is further amplified by the proliferation of multi-source and multi-format data, which significantly complicates the process of tracing data origins and transformations. Enterprises are investing heavily in data lineage solutions to ensure that their AI models are trained on high-quality, compliant, and auditable datasets, thereby reducing risks associated with data bias, inconsistency, and regulatory violations.




    Another significant growth factor is the evolving regulatory landscape surrounding AI and data governance. Governments and regulatory bodies worldwide are introducing stringent guidelines for data usage, privacy, and accountability in AI systems. Regulations such as the European Union’s AI Act and the U.S. AI Bill of Rights are compelling organizations to implement comprehensive data lineage practices to demonstrate compliance and mitigate legal risks. This regulatory pressure is particularly pronounced in highly regulated industries such as banking, healthcare, and government, where the consequences of non-compliance can be financially and reputationally devastating. As a result, the demand for advanced data lineage software and services is surging, driving market expansion.




    Technological advancements in data management platforms and the integration of AI-driven automation are further catalyzing the growth of the Data Lineage for LLM Training market. Modern data lineage tools now leverage machine learning and natural language processing to automatically map data flows, detect anomalies, and generate real-time lineage reports. These innovations drastically reduce the manual effort required for lineage documentation and enhance the scalability of lineage solutions across large and complex data environments. The continuous evolution of such technologies is enabling organizations to achieve higher levels of transparency, trust, and operational efficiency in their AI workflows, thereby fueling market growth.




    Regionally, North America dominates the Data Lineage for LLM Training market, accounting for over 42% of the global market share in 2024. This dominance is attributed to the early adoption of AI technologies, the presence of leading technology vendors, and a mature regulatory environment. Europe follows closely, driven by strict data governance regulations and a rapidly growing AI ecosystem. The Asia Pacific region is witnessing the fastest growth, with a projected CAGR of 24.6% through 2033, fueled by digital transformation initiatives, increased AI investments, and a burgeoning startup landscape. Latin America and the Middle East & Africa are also emerging as promising markets, albeit at a relatively nascent stage.



    Component Analysis




    The Data Lineage for LLM Training market is segmented by component into software and services, each playing a pivotal role in supporting organizations’ lineage initiatives. The software segment holds the largest market share, accounting for nearly 68% of the total market revenue in 2024. This dominance is primarily due to the widespread adoption of advanced data lineage platforms that offer features such as automated lineage mapping, visualization, impact analysis, and integration with existing data management and AI training workflows. These platforms are essential for organ

  4. h

    character-llm-data

    • huggingface.co
    Updated Jun 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenMOSS (2024). character-llm-data [Dataset]. https://huggingface.co/datasets/OpenMOSS-Team/character-llm-data
    Explore at:
    Dataset updated
    Jun 8, 2024
    Dataset authored and provided by
    OpenMOSS
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Character-LLM: A Trainable Agent for Role-Playing

    This is the training datasets for Character-LLM, which contains nine characters experience data used to train Character-LLMs. To download the dataset, please run the following code with Python, and you can find the downloaded data in /path/to/local_dir. from huggingface_hub import snapshot_download snapshot_download( local_dir_use_symlinks=True, repo_type="dataset", repo_id="fnlp/character-llm-data"… See the full description on the dataset page: https://huggingface.co/datasets/OpenMOSS-Team/character-llm-data.

  5. Fine-Tuning Text Data | 2 Millions | User Generated Text |Foundation Model |...

    • datarade.ai
    Updated Feb 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2025). Fine-Tuning Text Data | 2 Millions | User Generated Text |Foundation Model | SFT Data | Large Language Model(LLM) Data [Dataset]. https://datarade.ai/data-products/nexdata-multilingual-fine-tuning-text-data-2-millions-f-nexdata
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Feb 12, 2025
    Dataset authored and provided by
    Nexdata
    Area covered
    Ecuador, Austria, Spain, Chile, United States of America, South Africa, Egypt, Japan, Tunisia, Indonesia
    Description
    1. Overview Volume: 2 Millions Data use: Instruction-Following Evaluation for LLM
      Data content: A variety of complex prompt instructions, between 50 and 400 words, with no fewer than 3 constraints in each prompt Production method: All prompt are manually written to satisfy the diversity of coverage
      Language: English, Korean, French, German, Spanish, Russian, Italian, Dutch, Polish, Portuguese, Japanese, Indonesian, Vietnamese

    2. About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 3 million hours of Speech Data and 800TB of Computer Vision Data. These ready-to-go Machine Learning (ML) Data support instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/llm?source=Datarade

  6. Data Science QnA - LLM Fine-tuning

    • kaggle.com
    zip
    Updated Feb 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Divyang Mandal (2024). Data Science QnA - LLM Fine-tuning [Dataset]. https://www.kaggle.com/datasets/divyangmandal/data-science-qna-llm-fine-tuning
    Explore at:
    zip(83058 bytes)Available download formats
    Dataset updated
    Feb 26, 2024
    Authors
    Divyang Mandal
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Divyang Mandal

    Released under CC0: Public Domain

    Contents

  7. h

    long-llm-data

    • huggingface.co
    Updated May 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gagan Bhatia (2024). long-llm-data [Dataset]. https://huggingface.co/datasets/gagan3012/long-llm-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 1, 2024
    Authors
    Gagan Bhatia
    Description

    gagan3012/long-llm-data dataset hosted on Hugging Face and contributed by the HF Datasets community

  8. Image and Video Description Data | 1 PB | Multimodal Data | GenAI Data| LLM...

    • datarade.ai
    Updated Jan 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2025). Image and Video Description Data | 1 PB | Multimodal Data | GenAI Data| LLM Data | Large Language Model(LLM) Data [Dataset]. https://datarade.ai/data-products/nexdata-image-and-video-description-data-1-pb-multimoda-nexdata
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Jan 3, 2025
    Dataset authored and provided by
    Nexdata
    Area covered
    Canada, Israel, Mexico, Belgium, Ecuador, Malta, Czech Republic, United Arab Emirates, Netherlands, Finland
    Description
    1. Image Description Data Data Size: 500 million pairs Image Type: generic scene(portrait, landscapes, animals,etc), human action, picture book, magazine, PPT&chart, App screenshot, and etc. Resolution: 4K+ Description Language: English, Spanish, Portuguese, French, Korean, German, Chinese, Japanese Description Length: text length is no less than 250 words Format: the image format is .jpg, the annotation format is .json, and the description format is .txt

    2. Video Description Data Data Size: 10 million pairs Image Type: generic scene(portrait, landscapes, animals,etc), ads, TV sports, documentaries Resolution: 1080p+ Description Language: English, Spanish, Portuguese, French, Korean, German, Chinese, Japanese Description Length: text length is no less than 250 words Format: .mp4,.mov,.avi and other common formats;.xlsx (annotation file format)

    3. About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 3 million hours of Speech Data and 800TB of Computer Vision Data. These ready-to-go data supports instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/llm?source=Datarade

  9. Sample-Training-Data-LLM

    • kaggle.com
    zip
    Updated May 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hemanthh Velliyangirie (2024). Sample-Training-Data-LLM [Dataset]. https://www.kaggle.com/datasets/hemanthhvv/sample-training-data-llm
    Explore at:
    zip(2164 bytes)Available download formats
    Dataset updated
    May 4, 2024
    Authors
    Hemanthh Velliyangirie
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Hemanthh Velliyangirie

    Released under Apache 2.0

    Contents

  10. B

    Data from: TruthEval: A Dataset to Evaluate LLM Truthfulness and Reliability...

    • datasetcatalog.nlm.nih.gov
    • borealisdata.ca
    Updated Jul 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Khatun, Aisha; Brown, Dan (2024). TruthEval: A Dataset to Evaluate LLM Truthfulness and Reliability [Dataset]. http://doi.org/10.5683/SP3/5MZWBV
    Explore at:
    Dataset updated
    Jul 30, 2024
    Authors
    Khatun, Aisha; Brown, Dan
    Description

    Large Language Model (LLM) evaluation is currently one of the most important areas of research, with existing benchmarks proving to be insufficient and not completely representative of LLMs' various capabilities. We present a curated collection of challenging statements on sensitive topics for LLM benchmarking called TruthEval. These statements were curated by hand and contain known truth values. The categories were chosen to distinguish LLMs' abilities from their stochastic nature. Details of collection method and use cases can be found in this paper: TruthEval: A Dataset to Evaluate LLM Truthfulness and Reliability

  11. A Review of LLM-Assisted Ideation - Review Data

    • figshare.com
    zip
    Updated Mar 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sitong Li; Stefano Padilla; Pierre Le Bras; Junyu Dong; Mike Chantler (2025). A Review of LLM-Assisted Ideation - Review Data [Dataset]. http://doi.org/10.6084/m9.figshare.28440182.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 4, 2025
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Sitong Li; Stefano Padilla; Pierre Le Bras; Junyu Dong; Mike Chantler
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset includes the Extensive Tabular Format generated as part of a literature review. The table provides a structured, static representation of the data. A preliminary version of the review is currently available on arXiv : A Review of LLM-Assisted Ideation.For an interactive version, please visit the Online Format at: Notion Link.We hope this will inform and help future reviews and research in this area.

  12. h

    alpaca-gpt4-data

    • huggingface.co
    Updated Apr 10, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chris Alexiuk (2023). alpaca-gpt4-data [Dataset]. https://huggingface.co/datasets/llm-wizard/alpaca-gpt4-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 10, 2023
    Authors
    Chris Alexiuk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for "alpaca-gpt4-data"

    All of the work is done by this team.

      Usage and License Notices
    

    The data is intended and licensed for research use only. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

      Chinese Dataset
    

    Found here

      Citation
    

    @article{peng2023gpt4llm, title={Instruction Tuning with GPT-4}, author={Baolin Peng, Chunyuan Li… See the full description on the dataset page: https://huggingface.co/datasets/llm-wizard/alpaca-gpt4-data.

  13. Pre-training Text Data | 50 Millions | Unsupervised Text Data | Large...

    • datarade.ai
    Updated Jan 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2025). Pre-training Text Data | 50 Millions | Unsupervised Text Data | Large Language Model(LLM) Data [Dataset]. https://datarade.ai/data-products/nexdata-unsupervised-text-data-1-pb-foundation-model-nexdata
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Jan 3, 2025
    Dataset authored and provided by
    Nexdata
    Area covered
    France, Mexico, United Kingdom, Germany, Philippines, Taiwan, Spain, Korea (Republic of), China, Malaysia
    Description
    1. Overiview Off-the-shelf 50 Million pre-training text data, covering test question, textbook, ebooks, journal and papers, multi-round dialog text and etc.

    2. About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 1 million hours of Audio Data and 800TB of Annotated Imagery Data. These ready-to-go data supports instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/llm?source=Datarade

  14. Foundation Model Data Collection and Data Annotation | Large Language...

    • data.nexdata.ai
    Updated Aug 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2024). Foundation Model Data Collection and Data Annotation | Large Language Model(LLM) Data | SFT Data| Red Teaming Services [Dataset]. https://data.nexdata.ai/products/nexdata-foundation-model-data-solutions-llm-sft-rhlf-nexdata
    Explore at:
    Dataset updated
    Aug 15, 2024
    Dataset authored and provided by
    Nexdata
    Area covered
    Estonia, Nepal, Lebanon, Denmark, Costa Rica, Iran, Grenada, Barbados, Pakistan, Croatia
    Description

    For the high-quality training data required in unsupervised learning and supervised learning, Nexdata provides flexible and customized Large Language Model(LLM) Data Data annotation services for tasks such as supervised fine-tuning (SFT) , and reinforcement learning from human feedback (RLHF).

  15. Top web domains cited by LLMs 2025

    • statista.com
    Updated Jun 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Top web domains cited by LLMs 2025 [Dataset]. https://www.statista.com/statistics/1620335/top-web-domains-cited-by-llms/
    Explore at:
    Dataset updated
    Jun 29, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Jun 2025
    Area covered
    Worldwide
    Description

    A June 2025 study found that ****** was the most frequently cited web domain by large language models (LLMs). The platform was referenced in approximately ** percent of the analyzed cases, likely due to the content licensing agreement between Google and Reddit in early 2024 for the purpose of AI models training. ********* ranked second, being mentioned in roughly ** percent of the times, while ****** and ******* were mentioned ** percent.

  16. 300M Image-Caption Pairs – Large-Scale Vision-Language Dataset for AI...

    • m.nexdata.ai
    • nexdata.ai
    Updated Jan 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2025). 300M Image-Caption Pairs – Large-Scale Vision-Language Dataset for AI Training [Dataset]. https://m.nexdata.ai/datasets/llm/1451?source=Github
    Explore at:
    Dataset updated
    Jan 30, 2025
    Dataset authored and provided by
    Nexdata
    Variables measured
    Data size, Data types, Data content, Data formats, Data resolution, Description languages
    Description

    300 Million Pairs of High-Quality Image-Caption Dataset includes a large-scale collection of photographic and vector images paired with English textual descriptions. The complete image library comprises nearly 300 million images, with a curated subset of 100 million high-quality image-caption pairs available for generative AI and vision-language model training. All images are authentic and legally licensed works created by professional photographers. The dataset primarily features English captions with minimal Chinese, offering diverse scenes, objects, and compositions suitable for tasks such as image captioning, visual question answering (VQA), image-text retrieval, and multimodal foundation model pretraining. The dataset supports large-scale LLM and VLM applications and complies with global data privacy and copyright regulations, including GDPR, CCPA, and PIPL.

  17. 📊 6.5k train examples for LLM Science Exam 📝

    • kaggle.com
    Updated Jul 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Radek Osmulski (2023). 📊 6.5k train examples for LLM Science Exam 📝 [Dataset]. https://www.kaggle.com/datasets/radek1/additional-train-data-for-llm-science-exam
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 22, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Radek Osmulski
    Description

    I created this dataset using gpt-3.5-turbo.

    I put a lot of effort into making this dataset high quality which allows you to achieve the highest score among the publicly available notebooks available at the moment! 🥳

    Originally, I only uploaded 500 examples (they were used as train data in the notebook I mention above). They are stored in extra_train_set.csv.

    I am now uploading another 6k (6000_train_examples.csv) completely new train examples which brings the total to 6.5k.

    If you find this dataset useful, please leave an upvote! 😊 Thank you! 🙏🙏🙏

  18. D

    Data Annotation Tools Market Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Feb 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). Data Annotation Tools Market Report [Dataset]. https://www.archivemarketresearch.com/reports/data-annotation-tools-market-4890
    Explore at:
    pdf, doc, pptAvailable download formats
    Dataset updated
    Feb 18, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    global
    Variables measured
    Market Size
    Description

    The Data Annotation Tools Market size was valued at USD 1.31 billion in 2023 and is projected to reach USD 6.72 billion by 2032, exhibiting a CAGR of 26.3 % during the forecasts period. Recent developments include: In November 2023, Appen Limited, a high-quality data provider for the AI lifecycle, chose Amazon Web Services (AWS) as its primary cloud for AI solutions and innovation. As Appen utilizes additional enterprise solutions for AI data source, annotation, and model validation, the firms are expanding their collaboration with a multi-year deal. Appen is strengthening its AI data platform, which serves as the bridge between people and AI, by integrating cutting-edge AWS services. , In September 2023, Labelbox launched Large Language Model (LLM) solution to assist organizations in innovating with generative AI and deepen the partnership with Google Cloud. With the introduction of large language models (LLMs), enterprises now have a plethora of chances to generate new competitive advantages and commercial value. LLM systems have the ability to revolutionize a wide range of intelligent applications; nevertheless, in many cases, organizations will need to adjust or finetune LLMs in order to align with human preferences. Labelbox, as part of an expanded cooperation, is leveraging Google Cloud's generative AI capabilities to assist organizations in developing LLM solutions with Vertex AI. Labelbox's AI platform will be integrated with Google Cloud's leading AI and Data Cloud tools, including Vertex AI and Google Cloud's Model Garden repository, allowing ML teams to access cutting-edge machine learning (ML) models for vision and natural language processing (NLP) and automate key workflows. , In March 2023, has released the most recent version of Enlitic Curie, a platform aimed at improving radiology department workflow. This platform includes Curie|ENDEX, which uses natural language processing and computer vision to analyze and process medical images, and Curie|ENCOG, which uses artificial intelligence to detect and protect medical images in Health Information Security. , In November 2022, Appen Limited, a global leader in data for the AI Lifecycle, announced its partnership with CLEAR Global, a nonprofit organization dedicated to ensuring access to essential information and amplifying voices across languages. This collaboration aims to develop a speech-based healthcare FAQ bot tailored for Sheng, a Nairobi slang language. .

  19. Real Estate Data For LLM Fine-Tuning

    • kaggle.com
    zip
    Updated May 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Heba Mohamed (2025). Real Estate Data For LLM Fine-Tuning [Dataset]. https://www.kaggle.com/datasets/hebamo7amed/real-estate-data-for-llm-fine-tuning
    Explore at:
    zip(122810841 bytes)Available download formats
    Dataset updated
    May 7, 2025
    Authors
    Heba Mohamed
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Heba Mohamed

    Released under CC0: Public Domain

    Contents

  20. h

    grade-aware-llm-training-data

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yiming Wang, grade-aware-llm-training-data [Dataset]. https://huggingface.co/datasets/yimingwang123/grade-aware-llm-training-data
    Explore at:
    Authors
    Yiming Wang
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Grade-Aware LLM Training Dataset

      Dataset Description
    

    This dataset contains 1,107,690 high-quality instruction-tuning examples for grade-aware text simplification, designed for fine-tuning large language models to simplify text to specific reading grade levels with precision and semantic consistency.

      Dataset Summary
    

    Total Examples: 1,107,690 Task: Text simplification with precise grade-level targeting Language: English Grade Range: 1-12+ (precise 2-decimal… See the full description on the dataset page: https://huggingface.co/datasets/yimingwang123/grade-aware-llm-training-data.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
jaina (2024). LLMs Data (2018-2024) [Dataset]. https://www.kaggle.com/datasets/jainaru/llms-data-2018-2024
Organization logo

LLMs Data (2018-2024)

Every major LLM and chatbot released since 2018 with parameters, tokens, etc.

Explore at:
zip(23351 bytes)Available download formats
Dataset updated
May 19, 2024
Authors
jaina
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Every major LLM and chatbot released since 2018 with company and no. of billion parameters used to train.

Data Columns

  1. Model: The name of the language model.
  2. Company: The company that developed the model.
  3. Arch: The architecture of the model (e.g., Transformer, RNN).TBA means To Be Announced.
  4. Parameters: The number of parameters (weights) in the model, which is a measure of its complexity. In Billions
  5. Tokens: The number of tokens (sub-word units) the model can process or was trained on. Here, some values are TBA. In Billions
  6. Ratio: Likely the ratio of parameters to tokens, or some other relevant ratio. In this table, it is specified only for Olympus as 20:01.
  7. ALScore: ALScore is a quick and dirty rating of the model's power. The formula is: Sqr Root of (Parameters x Tokens).
  8. Training dataset: The dataset used to train the model.
  9. Release Date: The expected or actual release date of the model.
  10. Notes: Additional notes about the model, such as training details or related information.
  11. Playground: A URL link to a website where you can interact with the model or find more information about it.
Search
Clear search
Close search
Google apps
Main menu