100+ datasets found
  1. issues-kaggle-notebooks

    • huggingface.co
    Updated Aug 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face Smol Models Research (2025). issues-kaggle-notebooks [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/issues-kaggle-notebooks
    Explore at:
    Dataset updated
    Aug 12, 2025
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face Smol Models Research
    Description

    GitHub Issues & Kaggle Notebooks

      Description
    

    GitHub Issues & Kaggle Notebooks is a collection of two code datasets intended for language models training, they are sourced from GitHub issues and notebooks in Kaggle platform. These datasets are a modified part of the StarCoder2 model training corpus, precisely the bigcode/StarCoder2-Extras dataset. We reformat the samples to remove StarCoder2's special tokens and use natural text to delimit comments in issues and display… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/issues-kaggle-notebooks.

  2. hugging-face-transformer-bert-large-uncased

    • kaggle.com
    Updated Jun 21, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BowlOFruits (2021). hugging-face-transformer-bert-large-uncased [Dataset]. https://www.kaggle.com/datasets/ar4s23d6man/huggingfacetransformerbertlargeuncased
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 21, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    BowlOFruits
    Description

    Dataset

    This dataset was created by BowlOFruits

    Contents

  3. data

    • huggingface.co
    Updated Jul 27, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaggle MAP (2025). data [Dataset]. https://huggingface.co/datasets/kaggle-map/data
    Explore at:
    Dataset updated
    Jul 27, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Kaggle MAP
    Description

    kaggle-map/data dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. All GPT-4 Conversations

    • kaggle.com
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). All GPT-4 Conversations [Dataset]. https://www.kaggle.com/datasets/thedevastator/all-gpt-4-synthetic-chat-datasets
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    All GPT-4 Generated Datasets

    Every chat dataset generated by GPT-4 from Huggingface at the same format

    From [Huggingface datasets]

    About this dataset

    How to use the dataset

    The dataset includes all chat conversations generated by GPT-4 that are hosted on open Huggingface datasets. Everything is converted to the same format so the datasets can be easily merged and used for large scale training of LLMs.

    Acknowledgements

    This dataset is a collection of several single chat datasets. If you use this dataset in your research, please credit the original authors of the internal datasets. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

  5. h

    test-dataset-kaggle

    • huggingface.co
    Updated Feb 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gholamreza Dar (2024). test-dataset-kaggle [Dataset]. https://huggingface.co/datasets/Gholamreza/test-dataset-kaggle
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 15, 2024
    Authors
    Gholamreza Dar
    Description

    Gholamreza/test-dataset-kaggle dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. roleplay_snapshot

    • kaggle.com
    Updated Sep 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriel Preda (2024). roleplay_snapshot [Dataset]. https://www.kaggle.com/datasets/gpreda/roleplay-snapshot
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 19, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Gabriel Preda
    Description

    This dataset is a subset of https://huggingface.co/datasets/hieunguyenminh/roleplay

    It is used to exemplify the fine-tune of Gemma 2 2B model with roleplay data where we use samples of dialogues user/agent (with a system prompt/description) for each character (personality) we want to teach Gemma to imitate.

    For training, we will process the "text" column to extract triplets {system | user | assistant} and compose the prompts with which we fine-tune the model.

  7. Huggingface Transformers 2.8.0 whl

    • kaggle.com
    Updated Apr 26, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    xhlulu (2020). Huggingface Transformers 2.8.0 whl [Dataset]. https://www.kaggle.com/xhlulu/huggingface-transformers-280-whl/tasks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 26, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    xhlulu
    Description

    Dataset

    This dataset was created by xhlulu

    Contents

  8. MeDAL Dataset

    • kaggle.com
    • opendatalab.com
    • +1more
    zip
    Updated Nov 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    xhlulu (2020). MeDAL Dataset [Dataset]. https://www.kaggle.com/xhlulu/medal-emnlp
    Explore at:
    zip(7324382521 bytes)Available download formats
    Dataset updated
    Nov 16, 2020
    Authors
    xhlulu
    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2352583%2F868a18fb09d7a1d3da946d74a9857130%2FLogo.PNG?generation=1604973725053566&alt=media" alt="">

    Medical Dataset for Abbreviation Disambiguation for Natural Language Understanding (MeDAL) is a large medical text dataset curated for abbreviation disambiguation, designed for natural language understanding pre-training in the medical domain. It was published at the ClinicalNLP workshop at EMNLP.

    💻 Code 🤗 Dataset (Hugging Face) 💾 Dataset (Kaggle) 💽 Dataset (Zenodo) 📜 Paper (ACL) 📝 Paper (Arxiv)Pre-trained ELECTRA (Hugging Face)

    Downloading the data

    We recommend downloading from Kaggle if you can authenticate through their API. The advantage to Kaggle is that the data is compressed, so it will be faster to download. Links to the data can be found at the top of the readme.

    First, you will need to create an account on kaggle.com. Afterwards, you will need to install the kaggle API: pip install kaggle

    Then, you will need to follow the instructions here to add your username and key. Once that's done, you can run: kaggle datasets download xhlulu/medal-emnlp

    Now, unzip everything and place them inside the data directory: unzip -nq crawl-300d-2M-subword.zip -d data mv data/pretrain_sample/* data/

    Loading FastText Embeddings

    For the LSTM models, we will need to use the fastText embeddings. To do so, first download and extract the weights: wget -nc -P data/ https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip unzip -nq data/crawl-300d-2M-subword.zip -d data/

    Model Quickstart

    Using Torch Hub

    You can directly load LSTM and LSTM-SA with torch.hub: ```python import torch

    lstm = torch.hub.load("BruceWen120/medal", "lstm") lstm_sa = torch.hub.load("BruceWen120/medal", "lstm_sa") ```

    If you want to use the Electra model, you need to first install transformers: pip install transformers Then, you can load it with torch.hub: python import torch electra = torch.hub.load("BruceWen120/medal", "electra")

    Using Huggingface transformers

    If you are only interested in the pre-trained ELECTRA weights (without the disambiguation head), you can load it directly from the Hugging Face Repository:

    from transformers import AutoModel, AutoTokenizer
    
    model = AutoModel.from_pretrained("xhlu/electra-medal")
    tokenizer = AutoTokenizer.from_pretrained("xhlu/electra-medal")
    

    Citation

    Download the bibtex here, or copy the text below: @inproceedings{wen-etal-2020-medal, title = "{M}e{DAL}: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining", author = "Wen, Zhi and Lu, Xing Han and Reddy, Siva", booktitle = "Proceedings of the 3rd Clinical Natural Language Processing Workshop", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.clinicalnlp-1.15", pages = "130--135", }

    License, Terms and Conditions

    The ELECTRA model is licensed under Apache 2.0. The license for the libraries used in this project (transformers, pytorch, etc.) can be found in their respective GitHub repository. Our model is released under a MIT license.

    The original dataset was retrieved and modified from the NLM website. By using this dataset, you are bound by the terms and conditions specified by NLM:

    INTRODUCTION

    Downloading data from the National Library of Medicine FTP servers indicates your acceptance of the following Terms and Conditions: No charges, usage fees or royalties are paid to NLM for this data.

    MEDLINE/PUBMED SPECIFIC TERMS

    NLM freely provides PubMed/MEDLINE data. Please note some PubMed/MEDLINE abstracts may be protected by copyright.

    GENERAL TERMS AND CONDITIONS

    • Users of the data agree to:

      • acknowledge NLM as the source of the data by including the phrase "Courtesy of the U.S. National Library of Medicine" in a clear and conspicuous manner,
      • properly use registration and/or trademark symbols when referring to NLM products, and
      • not indicate or imply that NLM has endorsed its products/services/applications.
    • Users who republish or redistribute the data (services, products or raw data) agree to:

      • maintain the most current version of all distributed data, or
      • make known in a clear and conspicuous manner that the products/services/applications do not reflect the most current/accurate data available from NLM.
    • These data are produced with a reasonable standard of care, but NLM makes no warranties express or implied, including no warranty of merchantability or fitness for particular purpose, regarding the accuracy or completeness of the data. Users agree to hold NLM and the U.S. Government harmless from any liability resulting from errors in the data. NLM disclaims any liability for any consequences due to use, misuse, or interpretation of information contained or not contained in the data.

    • NLM does not provide legal advice regarding copyright, fair use, or other aspects of intellectual property rights. See the NLM Copyright page.

    • NLM reserves the right to change the type and format of its machine-readable data. NLM will take reasonable steps to inform users of any changes to the format of the data before the data are distributed via the announcement section or subscription to email and RSS updates.

  9. huggingface-models-new

    • kaggle.com
    Updated Jun 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Moein Shariatnia (2021). huggingface-models-new [Dataset]. https://www.kaggle.com/moeinshariatnia/huggingfacemodelsnew/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 15, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Moein Shariatnia
    Description

    Dataset

    This dataset was created by Moein Shariatnia

    Contents

  10. Huggingface distilbert-base-cased

    • kaggle.com
    Updated Apr 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kuro (2021). Huggingface distilbert-base-cased [Dataset]. https://www.kaggle.com/enukuro/huggingface-distilbertbasecased/tasks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 4, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Kuro
    Description

    Dataset

    This dataset was created by Kuro

    Contents

  11. XLNet base cased

    • kaggle.com
    Updated Jun 17, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Liyan Tang (2020). XLNet base cased [Dataset]. https://www.kaggle.com/datasets/jay0606/xlnet-base-cased/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 17, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Liyan Tang
    Description

    Dataset

    This dataset was created by Liyan Tang

    Contents

  12. huggingface_hub 0.0.8

    • kaggle.com
    zip
    Updated Jun 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Y.Nakama (2021). huggingface_hub 0.0.8 [Dataset]. https://www.kaggle.com/yasufuminakama/huggingface-hub-008
    Explore at:
    zip(33225 bytes)Available download formats
    Dataset updated
    Jun 18, 2021
    Authors
    Y.Nakama
    Description
  13. bitsandbytes

    • kaggle.com
    Updated Dec 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FullEmpty (2024). bitsandbytes [Dataset]. https://www.kaggle.com/datasets/gowillgo/bitsandbytes/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 7, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    FullEmpty
    Description

    Dataset

    This dataset was created by FullEmpty

    Contents

    Bitsandbytes 0.44.1

  14. aime_filtered

    • huggingface.co
    Updated Aug 29, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaggle winners (2024). aime_filtered [Dataset]. https://huggingface.co/datasets/kaggle-aimo/aime_filtered
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 29, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Kaggle winners
    Description

    kaggle-aimo/aime_filtered dataset hosted on Hugging Face and contributed by the HF Datasets community

  15. h

    refined-kaggle

    • huggingface.co
    Updated Aug 9, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Taha K (2024). refined-kaggle [Dataset]. https://huggingface.co/datasets/tsk-18/refined-kaggle
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 9, 2024
    Authors
    Taha K
    Description

    tsk-18/refined-kaggle dataset hosted on Hugging Face and contributed by the HF Datasets community

  16. luke_base huggingface model

    • kaggle.com
    Updated Dec 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    raghavendrakuttala (2021). luke_base huggingface model [Dataset]. https://www.kaggle.com/raghavendrakotala/luke-base-huggingface-model/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 18, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    raghavendrakuttala
    Description

    Dataset

    This dataset was created by raghavendrakuttala

    Contents

  17. huggingface-distilbert-tokenizer

    • kaggle.com
    Updated Jun 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maris Galesloot (2021). huggingface-distilbert-tokenizer [Dataset]. https://www.kaggle.com/marisgalesloot/huggingfacedistilberttokenizer/metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 4, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Maris Galesloot
    Description

    Dataset

    This dataset was created by Maris Galesloot

    Contents

  18. h

    kaggle-claim-type

    • huggingface.co
    Updated Jun 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    tasksource (2023). kaggle-claim-type [Dataset]. https://huggingface.co/datasets/tasksource/kaggle-claim-type
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 15, 2023
    Dataset authored and provided by
    tasksource
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    tasksource/kaggle-claim-type dataset hosted on Hugging Face and contributed by the HF Datasets community

  19. C

    Community-Driven Model Service Platform Report

    • marketreportanalytics.com
    doc, pdf, ppt
    Updated Apr 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Report Analytics (2025). Community-Driven Model Service Platform Report [Dataset]. https://www.marketreportanalytics.com/reports/community-driven-model-service-platform-73131
    Explore at:
    pdf, doc, pptAvailable download formats
    Dataset updated
    Apr 9, 2025
    Dataset authored and provided by
    Market Report Analytics
    License

    https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Community-Driven Model Service Platform market is experiencing robust growth, projected to reach $35.14 billion in 2025 and maintain a Compound Annual Growth Rate (CAGR) of 10.1% from 2025 to 2033. This expansion is fueled by several key factors. The increasing adoption of machine learning and artificial intelligence across diverse sectors, coupled with the need for readily accessible and collaboratively improved models, is driving significant demand. The open-source nature of many platforms fosters innovation and reduces barriers to entry for both developers and businesses. Furthermore, the rise of cloud-based solutions offers scalability and cost-effectiveness, contributing to market expansion. The platform's segmentation into adult and children's applications reflects diverse use cases, ranging from sophisticated research projects to educational tools, further broadening its appeal. The presence of established players like Kaggle, GitHub, and Hugging Face indicates a maturing market with strong community engagement, while the existence of on-premises options caters to businesses with stringent data security requirements. Geographical expansion is also a significant contributor to growth, with North America and Europe currently leading the market, while Asia-Pacific is poised for significant future expansion driven by increasing digitalization and technological advancements. The market's continued growth is anticipated to be driven by advancements in model training techniques, the development of more user-friendly interfaces, and the increasing integration of these platforms with other data science tools and workflows. Challenges remain, however, such as ensuring data quality and addressing potential biases in community-contributed models. Furthermore, regulatory concerns around data privacy and model transparency will need to be carefully addressed to maintain sustainable growth. The competitive landscape is expected to remain dynamic, with ongoing innovation and consolidation among existing players and the emergence of new entrants. The strategic focus on improving model accessibility, enhancing community engagement, and expanding into new geographical markets will be key determinants of success in this rapidly evolving sector.

  20. distilbert-base-uncased-huggingface-transformer

    • kaggle.com
    Updated Feb 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    tk (2020). distilbert-base-uncased-huggingface-transformer [Dataset]. https://www.kaggle.com/tkscsk/distilbertbaseuncasedhuggingfacetransformer/activity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 12, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    tk
    Description

    Dataset

    This dataset was created by tk

    Contents

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Hugging Face Smol Models Research (2025). issues-kaggle-notebooks [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/issues-kaggle-notebooks
Organization logo

issues-kaggle-notebooks

HuggingFaceTB/issues-kaggle-notebooks

Explore at:
Dataset updated
Aug 12, 2025
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face Smol Models Research
Description

GitHub Issues & Kaggle Notebooks

  Description

GitHub Issues & Kaggle Notebooks is a collection of two code datasets intended for language models training, they are sourced from GitHub issues and notebooks in Kaggle platform. These datasets are a modified part of the StarCoder2 model training corpus, precisely the bigcode/StarCoder2-Extras dataset. We reformat the samples to remove StarCoder2's special tokens and use natural text to delimit comments in issues and display… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/issues-kaggle-notebooks.

Search
Clear search
Close search
Google apps
Main menu