100+ datasets found

issues-kaggle-notebooks
huggingface.co
Updated Aug 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face Smol Models Research (2025). issues-kaggle-notebooks [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/issues-kaggle-notebooks
Explore at:
Dataset updated
Aug 12, 2025
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face Smol Models Research
Description
GitHub Issues & Kaggle Notebooks

Description

GitHub Issues & Kaggle Notebooks is a collection of two code datasets intended for language models training, they are sourced from GitHub issues and notebooks in Kaggle platform. These datasets are a modified part of the StarCoder2 model training corpus, precisely the bigcode/StarCoder2-Extras dataset. We reformat the samples to remove StarCoder2's special tokens and use natural text to delimit comments in issues and display… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/issues-kaggle-notebooks.
hugging-face-transformer-bert-large-uncased
kaggle.com
Updated Jun 21, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BowlOFruits (2021). hugging-face-transformer-bert-large-uncased [Dataset]. https://www.kaggle.com/datasets/ar4s23d6man/huggingfacetransformerbertlargeuncased
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 21, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
BowlOFruits
Description
Dataset

This dataset was created by BowlOFruits

Contents
data
huggingface.co
Updated Jul 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaggle MAP (2025). data [Dataset]. https://huggingface.co/datasets/kaggle-map/data
Explore at:
Dataset updated
Jul 27, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Kaggle MAP
Description
kaggle-map/data dataset hosted on Hugging Face and contributed by the HF Datasets community
All GPT-4 Conversations
kaggle.com
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). All GPT-4 Conversations [Dataset]. https://www.kaggle.com/datasets/thedevastator/all-gpt-4-synthetic-chat-datasets
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 21, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description

All GPT-4 Generated Datasets

Every chat dataset generated by GPT-4 from Huggingface at the same format

From [Huggingface datasets]

About this dataset

How to use the dataset

The dataset includes all chat conversations generated by GPT-4 that are hosted on open Huggingface datasets. Everything is converted to the same format so the datasets can be easily merged and used for large scale training of LLMs.

Acknowledgements

This dataset is a collection of several single chat datasets. If you use this dataset in your research, please credit the original authors of the internal datasets. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
h
test-dataset-kaggle
huggingface.co
Updated Feb 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gholamreza Dar (2024). test-dataset-kaggle [Dataset]. https://huggingface.co/datasets/Gholamreza/test-dataset-kaggle
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 15, 2024
Authors
Gholamreza Dar
Description
Gholamreza/test-dataset-kaggle dataset hosted on Hugging Face and contributed by the HF Datasets community
roleplay_snapshot
kaggle.com
Updated Sep 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriel Preda (2024). roleplay_snapshot [Dataset]. https://www.kaggle.com/datasets/gpreda/roleplay-snapshot
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 19, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Gabriel Preda
Description
This dataset is a subset of https://huggingface.co/datasets/hieunguyenminh/roleplay

It is used to exemplify the fine-tune of Gemma 2 2B model with roleplay data where we use samples of dialogues user/agent (with a system prompt/description) for each character (personality) we want to teach Gemma to imitate.

For training, we will process the "text" column to extract triplets {system | user | assistant} and compose the prompts with which we fine-tune the model.
Huggingface Transformers 2.8.0 whl
kaggle.com
Updated Apr 26, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
xhlulu (2020). Huggingface Transformers 2.8.0 whl [Dataset]. https://www.kaggle.com/xhlulu/huggingface-transformers-280-whl/tasks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 26, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
xhlulu
Description
Dataset

This dataset was created by xhlulu

Contents
MeDAL Dataset
kaggle.com
opendatalab.com
+1more
zip
Updated Nov 16, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
xhlulu (2020). MeDAL Dataset [Dataset]. https://www.kaggle.com/xhlulu/medal-emnlp
Explore at:
zip(7324382521 bytes)Available download formats
Dataset updated
Nov 16, 2020
Authors
xhlulu
Description
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2352583%2F868a18fb09d7a1d3da946d74a9857130%2FLogo.PNG?generation=1604973725053566&alt=media" alt="">

Medical Dataset for Abbreviation Disambiguation for Natural Language Understanding (MeDAL) is a large medical text dataset curated for abbreviation disambiguation, designed for natural language understanding pre-training in the medical domain. It was published at the ClinicalNLP workshop at EMNLP.

💻 Code 🤗 Dataset (Hugging Face) 💾 Dataset (Kaggle) 💽 Dataset (Zenodo) 📜 Paper (ACL) 📝 Paper (Arxiv) ⚡ Pre-trained ELECTRA (Hugging Face)

Downloading the data

We recommend downloading from Kaggle if you can authenticate through their API. The advantage to Kaggle is that the data is compressed, so it will be faster to download. Links to the data can be found at the top of the readme.

First, you will need to create an account on kaggle.com. Afterwards, you will need to install the kaggle API: pip install kaggle

Then, you will need to follow the instructions here to add your username and key. Once that's done, you can run: kaggle datasets download xhlulu/medal-emnlp

Now, unzip everything and place them inside the data directory: unzip -nq crawl-300d-2M-subword.zip -d data mv data/pretrain_sample/* data/

Loading FastText Embeddings

For the LSTM models, we will need to use the fastText embeddings. To do so, first download and extract the weights: wget -nc -P data/ https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip unzip -nq data/crawl-300d-2M-subword.zip -d data/

Model Quickstart

Using Torch Hub

You can directly load LSTM and LSTM-SA with torch.hub: ```python import torch

lstm = torch.hub.load("BruceWen120/medal", "lstm") lstm_sa = torch.hub.load("BruceWen120/medal", "lstm_sa") ```

If you want to use the Electra model, you need to first install transformers: pip install transformers Then, you can load it with torch.hub: python import torch electra = torch.hub.load("BruceWen120/medal", "electra")

Using Huggingface transformers

If you are only interested in the pre-trained ELECTRA weights (without the disambiguation head), you can load it directly from the Hugging Face Repository:

from transformers import AutoModel, AutoTokenizer model = AutoModel.from_pretrained("xhlu/electra-medal") tokenizer = AutoTokenizer.from_pretrained("xhlu/electra-medal")

Citation

Download the bibtex here, or copy the text below: @inproceedings{wen-etal-2020-medal, title = "{M}e{DAL}: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining", author = "Wen, Zhi and Lu, Xing Han and Reddy, Siva", booktitle = "Proceedings of the 3rd Clinical Natural Language Processing Workshop", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.clinicalnlp-1.15", pages = "130--135", }

License, Terms and Conditions

The ELECTRA model is licensed under Apache 2.0. The license for the libraries used in this project (transformers, pytorch, etc.) can be found in their respective GitHub repository. Our model is released under a MIT license.

The original dataset was retrieved and modified from the NLM website. By using this dataset, you are bound by the terms and conditions specified by NLM:

INTRODUCTION

Downloading data from the National Library of Medicine FTP servers indicates your acceptance of the following Terms and Conditions: No charges, usage fees or royalties are paid to NLM for this data.

MEDLINE/PUBMED SPECIFIC TERMS

NLM freely provides PubMed/MEDLINE data. Please note some PubMed/MEDLINE abstracts may be protected by copyright.

GENERAL TERMS AND CONDITIONS

Users of the data agree to:

acknowledge NLM as the source of the data by including the phrase "Courtesy of the U.S. National Library of Medicine" in a clear and conspicuous manner,

properly use registration and/or trademark symbols when referring to NLM products, and

not indicate or imply that NLM has endorsed its products/services/applications.

Users who republish or redistribute the data (services, products or raw data) agree to:

maintain the most current version of all distributed data, or

make known in a clear and conspicuous manner that the products/services/applications do not reflect the most current/accurate data available from NLM.

These data are produced with a reasonable standard of care, but NLM makes no warranties express or implied, including no warranty of merchantability or fitness for particular purpose, regarding the accuracy or completeness of the data. Users agree to hold NLM and the U.S. Government harmless from any liability resulting from errors in the data. NLM disclaims any liability for any consequences due to use, misuse, or interpretation of information contained or not contained in the data.

NLM does not provide legal advice regarding copyright, fair use, or other aspects of intellectual property rights. See the NLM Copyright page.

NLM reserves the right to change the type and format of its machine-readable data. NLM will take reasonable steps to inform users of any changes to the format of the data before the data are distributed via the announcement section or subscription to email and RSS updates.
huggingface-models-new
kaggle.com
Updated Jun 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moein Shariatnia (2021). huggingface-models-new [Dataset]. https://www.kaggle.com/moeinshariatnia/huggingfacemodelsnew/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 15, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Moein Shariatnia
Description
Dataset

This dataset was created by Moein Shariatnia

Contents
Huggingface distilbert-base-cased
kaggle.com
Updated Apr 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kuro (2021). Huggingface distilbert-base-cased [Dataset]. https://www.kaggle.com/enukuro/huggingface-distilbertbasecased/tasks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 4, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Kuro
Description
Dataset

This dataset was created by Kuro

Contents
XLNet base cased
kaggle.com
Updated Jun 17, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Liyan Tang (2020). XLNet base cased [Dataset]. https://www.kaggle.com/datasets/jay0606/xlnet-base-cased/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 17, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Liyan Tang
Description
Dataset

This dataset was created by Liyan Tang

Contents
huggingface_hub 0.0.8
kaggle.com
zip
Updated Jun 18, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Y.Nakama (2021). huggingface_hub 0.0.8 [Dataset]. https://www.kaggle.com/yasufuminakama/huggingface-hub-008
Explore at:
zip(33225 bytes)Available download formats
Dataset updated
Jun 18, 2021
Authors
Y.Nakama
Description
From https://www.piwheels.org/project/huggingface-hub/
bitsandbytes
kaggle.com
Updated Dec 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FullEmpty (2024). bitsandbytes [Dataset]. https://www.kaggle.com/datasets/gowillgo/bitsandbytes/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 7, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
FullEmpty
Description
Dataset

This dataset was created by FullEmpty

Contents

Bitsandbytes 0.44.1
aime_filtered
huggingface.co
Updated Aug 29, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaggle winners (2024). aime_filtered [Dataset]. https://huggingface.co/datasets/kaggle-aimo/aime_filtered
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 29, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Kaggle winners
Description
kaggle-aimo/aime_filtered dataset hosted on Hugging Face and contributed by the HF Datasets community
h
refined-kaggle
huggingface.co
Updated Aug 9, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Taha K (2024). refined-kaggle [Dataset]. https://huggingface.co/datasets/tsk-18/refined-kaggle
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 9, 2024
Authors
Taha K
Description
tsk-18/refined-kaggle dataset hosted on Hugging Face and contributed by the HF Datasets community
luke_base huggingface model
kaggle.com
Updated Dec 18, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
raghavendrakuttala (2021). luke_base huggingface model [Dataset]. https://www.kaggle.com/raghavendrakotala/luke-base-huggingface-model/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 18, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
raghavendrakuttala
Description
Dataset

This dataset was created by raghavendrakuttala

Contents
huggingface-distilbert-tokenizer
kaggle.com
Updated Jun 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maris Galesloot (2021). huggingface-distilbert-tokenizer [Dataset]. https://www.kaggle.com/marisgalesloot/huggingfacedistilberttokenizer/metadata
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 4, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Maris Galesloot
Description
Dataset

This dataset was created by Maris Galesloot

Contents
h
kaggle-claim-type
huggingface.co
Updated Jun 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
tasksource (2023). kaggle-claim-type [Dataset]. https://huggingface.co/datasets/tasksource/kaggle-claim-type
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 15, 2023
Dataset authored and provided by
tasksource
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
tasksource/kaggle-claim-type dataset hosted on Hugging Face and contributed by the HF Datasets community
C
Community-Driven Model Service Platform Report
marketreportanalytics.com
doc, pdf, ppt
Updated Apr 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Report Analytics (2025). Community-Driven Model Service Platform Report [Dataset]. https://www.marketreportanalytics.com/reports/community-driven-model-service-platform-73131
Explore at:
pdf, doc, pptAvailable download formats
Dataset updated
Apr 9, 2025
Dataset authored and provided by
Market Report Analytics
License
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Community-Driven Model Service Platform market is experiencing robust growth, projected to reach $35.14 billion in 2025 and maintain a Compound Annual Growth Rate (CAGR) of 10.1% from 2025 to 2033. This expansion is fueled by several key factors. The increasing adoption of machine learning and artificial intelligence across diverse sectors, coupled with the need for readily accessible and collaboratively improved models, is driving significant demand. The open-source nature of many platforms fosters innovation and reduces barriers to entry for both developers and businesses. Furthermore, the rise of cloud-based solutions offers scalability and cost-effectiveness, contributing to market expansion. The platform's segmentation into adult and children's applications reflects diverse use cases, ranging from sophisticated research projects to educational tools, further broadening its appeal. The presence of established players like Kaggle, GitHub, and Hugging Face indicates a maturing market with strong community engagement, while the existence of on-premises options caters to businesses with stringent data security requirements. Geographical expansion is also a significant contributor to growth, with North America and Europe currently leading the market, while Asia-Pacific is poised for significant future expansion driven by increasing digitalization and technological advancements. The market's continued growth is anticipated to be driven by advancements in model training techniques, the development of more user-friendly interfaces, and the increasing integration of these platforms with other data science tools and workflows. Challenges remain, however, such as ensuring data quality and addressing potential biases in community-contributed models. Furthermore, regulatory concerns around data privacy and model transparency will need to be carefully addressed to maintain sustainable growth. The competitive landscape is expected to remain dynamic, with ongoing innovation and consolidation among existing players and the emergence of new entrants. The strategic focus on improving model accessibility, enhancing community engagement, and expanding into new geographical markets will be key determinants of success in this rapidly evolving sector.
distilbert-base-uncased-huggingface-transformer
kaggle.com
Updated Feb 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
tk (2020). distilbert-base-uncased-huggingface-transformer [Dataset]. https://www.kaggle.com/tkscsk/distilbertbaseuncasedhuggingfacetransformer/activity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 12, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
tk
Description
Dataset

This dataset was created by tk

Contents

Facebook

Twitter

Click to copy link

Link copied

Cite

Hugging Face Smol Models Research (2025). issues-kaggle-notebooks [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/issues-kaggle-notebooks

issues-kaggle-notebooks

HuggingFaceTB/issues-kaggle-notebooks

Explore at:

Dataset updated

Aug 12, 2025

Dataset provided by

Hugging Facehttps://huggingface.co/

Authors

Hugging Face Smol Models Research

Description

GitHub Issues & Kaggle Notebooks

  Description

GitHub Issues & Kaggle Notebooks is a collection of two code datasets intended for language models training, they are sourced from GitHub issues and notebooks in Kaggle platform. These datasets are a modified part of the StarCoder2 model training corpus, precisely the bigcode/StarCoder2-Extras dataset. We reformat the samples to remove StarCoder2's special tokens and use natural text to delimit comments in issues and display… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/issues-kaggle-notebooks.

Clear search

Close search

Google apps

Main menu

issues-kaggle-notebooks

hugging-face-transformer-bert-large-uncased

Dataset

Contents

data

All GPT-4 Conversations

All GPT-4 Generated Datasets

Every chat dataset generated by GPT-4 from Huggingface at the same format

About this dataset

How to use the dataset

Acknowledgements

License

test-dataset-kaggle

roleplay_snapshot

Huggingface Transformers 2.8.0 whl

Dataset

Contents

MeDAL Dataset

Downloading the data

Loading FastText Embeddings

Model Quickstart

Using Torch Hub

Using Huggingface transformers

Citation

License, Terms and Conditions

huggingface-models-new

Dataset

Contents

Huggingface distilbert-base-cased

Dataset

Contents

XLNet base cased

Dataset

Contents

huggingface_hub 0.0.8

bitsandbytes

Dataset

Contents

aime_filtered

refined-kaggle

luke_base huggingface model

Dataset

Contents

huggingface-distilbert-tokenizer

Dataset

Contents

kaggle-claim-type

Community-Driven Model Service Platform Report

distilbert-base-uncased-huggingface-transformer

Dataset

Contents

issues-kaggle-notebooks

HuggingFaceTB/issues-kaggle-notebooks

Using Huggingface `transformers`