45 datasets found
  1. Hugging Face Models

    • kaggle.com
    zip
    Updated Nov 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    A T M Ragib Raihan (2023). Hugging Face Models [Dataset]. https://www.kaggle.com/datasets/atmragib/hugging-face-models/code
    Explore at:
    zip(13652285 bytes)Available download formats
    Dataset updated
    Nov 28, 2023
    Authors
    A T M Ragib Raihan
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Contex

    The Hugging Face Hub hosts many models for a variety of machine learning tasks. Models are stored in repositories, so they benefit from all the features possessed by every repo on the Hugging Face Hub.

    Data Source Link: huggingface.co/models

    Attribute Information

    VariableDescription
    model_id
    pipelineThere are total 40 pipelines. To learn more read: Hugging Face Pipeline
    downloads
    likes
    author_id
    author_name
    author_typeuser or organization
    author_isProPaid user or organization
    lastModifiedfrom 2014-08-10 to 2023-11-27
  2. h

    ktda-datasets

    • huggingface.co
    Updated Dec 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    XavierJiezou (2024). ktda-datasets [Dataset]. https://huggingface.co/datasets/XavierJiezou/ktda-datasets
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 8, 2024
    Authors
    XavierJiezou
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    KTDA-Datasets

    This dataset card aims to describe the datasets used in the KTDA.

      Install
    

    pip install huggingface-hub

      Usage
    

    Step 1: Download datasets

    huggingface-cli download --repo-type dataset XavierJiezou/ktda-datasets --local-dir data --include grass.zip huggingface-cli download --repo-type dataset XavierJiezou/ktda-datasets --local-dir data --include cloud.zip

    Step 2: Extract datasets

    unzip grass.zip -d grass unzip cloud.zip -d l8_biome… See the full description on the dataset page: https://huggingface.co/datasets/XavierJiezou/ktda-datasets.

  3. Huggingface Modelhub

    • kaggle.com
    zip
    Updated Jun 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kartik Godawat (2021). Huggingface Modelhub [Dataset]. https://www.kaggle.com/crazydiv/huggingface-modelhub
    Explore at:
    zip(2274876 bytes)Available download formats
    Dataset updated
    Jun 19, 2021
    Authors
    Kartik Godawat
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://huggingface.co/landing/assets/transformers-docs/huggingface_logo.svg" alt="HuggingFace">

    Dataset containing metadata information of all the publicly uploaded models(10,000+) available on HuggingFace model hub Data was collected between 15-20th June 2021.

    Dataset was generated using huggingface_hub APIs provided by huggingface team.

    Update v3:

    • Added Downloads last month metric
    • Added library name

    Contents:

    • huggingface_models.csv : Primary file which contains metadata information like model name, tags, last modified and filenames
    • huggingface_modelcard_readme.csv : Detailed file containing README.md contents if available for a particular model. Content is in markdown format. modelId column joins both the files together. ### huggingface_models.csv
    • modelId: ID of the model as present on HF website
    • lastModified: Time when this model was last modified
    • tags: Tags associated with the model (provided by mantainer)
    • pipeline_tag: If exists, denotes which pipeline this model could be used with
    • files: List of available files in the model repo
    • publishedBy: Custom column derived from modelID, specifying who published this model
    • downloads_last_month: Number of times the model has been downloaded in last month.
    • library: Name of library the model belongs to eg: transformers, spacy, timm etc. ### huggingface_modelcard_readme.csv
    • modelId: ID of the model as available on HF website
    • modelCard: Readme contents of a model (referred to as modelCard in HuggingFace ecoystem). It contains useful information on how the model was trained, benchmarks and author notes. ### Inspiration: The idea of analyzing publicly available models on HugginFace struck me while I was attending a livesession of the amazing transformers course by @LysandreJik. Soon after, I tweeted the team and asked for permission to create such a dataset. Special shoutout to @osanseviero for encouraging and pointing me in the right direction.

    This is my first dataset upload on Kaggle. I hope you like it. :)

  4. h

    repo_names

    • huggingface.co
    Updated Jul 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    git2vec (2023). repo_names [Dataset]. https://huggingface.co/datasets/git2vec/repo_names
    Explore at:
    Dataset updated
    Jul 26, 2023
    Dataset authored and provided by
    git2vec
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Description

    This dataset tracks repository name changes over time. Each row represents a unique combination of repository ID and name, with the timestamp of when that name was first observed. Since GitHub allows repository renaming while preserving the internal repository ID, this dataset enables tracking the full naming history of any repository.

      Schema
    

    Column Type Description

    repo_id int64 GitHub's internal repository identifier

    repo_name string Repository… See the full description on the dataset page: https://huggingface.co/datasets/git2vec/repo_names.

  5. google/flan-t5-large

    • kaggle.com
    zip
    Updated Jul 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    d0rj_ (2023). google/flan-t5-large [Dataset]. https://www.kaggle.com/datasets/d0rj3228/googleflan-t5-large
    Explore at:
    zip(23751646406 bytes)Available download formats
    Dataset updated
    Jul 14, 2023
    Authors
    d0rj_
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Info

    Source repo is google/flan-t5-large.

    Usage

    1. Add dataset to Kaggle notebook;
    2. Import pretrained from folder;
    from transformers import AutoTokenizer, AutoModel
    
    
    model = AutoModel.from_pretrained('/kaggle/input/googleflan-t5-large/flan-t5-large')
    tokenizer = AutoTokenizer.from_pretrained('/kaggle/input/googleflan-t5-large/flan-t5-large')
    
    
  6. Huggingface BERT

    • kaggle.com
    zip
    Updated Jun 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    xhlulu (2025). Huggingface BERT [Dataset]. https://www.kaggle.com/xhlulu/huggingface-bert
    Explore at:
    zip(25978385354 bytes)Available download formats
    Dataset updated
    Jun 21, 2025
    Authors
    xhlulu
    Description

    This dataset contains many popular BERT weights retrieved directly on Hugging Face's model repository, and hosted on Kaggle. It will be automatically updated every month to ensure that the latest version is available to the user. By making it a dataset, it is significantly faster to load the weights since you can directly attach a Kaggle dataset to the notebook rather than downloading the data every time. See the speed comparison notebook.

    The banner was adapted from figures by Jimmy Lin (tweet; slide) released under CC BY 4.0. BERT has an Apache 2.0 license according to the model repository.

    Quick Start

    To use this dataset, simply attach it the your notebook and specify the path to the dataset. For example:

    from transformers import AutoTokenizer, AutoModelForMaskedLM
    
    MODEL_DIR = "/kaggle/input/huggingface-bert/"
    tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR + "bert-large-uncased")
    model = AutoModelForMaskedLM.from_pretrained(MODEL_DIR + "bert-large-uncased")
    

    Acknowledgements

    All the copyrights and IP relating to BERT belong to the original authors (Devlin et. al 2019) and Google. All copyrights relating to the transformers library belong to Hugging Face. The banner image was created thanks to Jimmy Lin so any modification of this figure should mention the original author and respect the conditions of the license; all copyrights related to the images belong to him.

    Some of the models are community created or trained. Please reach out directly to the authors if you have questions regarding licenses and usage.

  7. h

    github-r-repos

    • huggingface.co
    Updated Jun 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Falbel (2023). github-r-repos [Dataset]. https://huggingface.co/datasets/dfalbel/github-r-repos
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 6, 2023
    Authors
    Daniel Falbel
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    GitHub R repositories dataset

    R source files from GitHub. This dataset has been created using the public GitHub datasets from Google BigQuery. This is the actual query that has been used to export the data: EXPORT DATA OPTIONS ( uri = 'gs://your-bucket/gh-r/*.parquet', format = 'PARQUET') as ( select f.id, f.repo_name, f.path, c.content, c.size from ( SELECT distinct id, repo_name, path FROM bigquery-public-data.github_repos.files where ends_with(path… See the full description on the dataset page: https://huggingface.co/datasets/dfalbel/github-r-repos.

  8. HF FineWeb 2 Dataset

    • kaggle.com
    zip
    Updated Jan 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Umer Haddii (2025). HF FineWeb 2 Dataset [Dataset]. https://www.kaggle.com/datasets/umerhaddii/fineweb-2-dataset
    Explore at:
    zip(1224570 bytes)Available download formats
    Dataset updated
    Jan 28, 2025
    Authors
    Umer Haddii
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Context

    FineWeb 2 is the second iteration of the popular 🍷 FineWeb dataset, bringing high quality pretraining data to over 1000 🗣️ languages. For the actual data, please see the HuggingFace repository.

    This is the second iteration of the popular 🍷 FineWeb dataset, bringing high quality pretraining data to over 1000 🗣️ languages.

    The 🥂 FineWeb2 dataset is fully reproducible, available under the permissive ODC-By 1.0 license and extensively validated through hundreds of ablation experiments.

    In particular, on the set of 9 diverse languages we used to guide our processing decisions, 🥂 FineWeb2 outperforms other popular pretraining datasets covering multiple languages (such as CC-100, mC4, CulturaX or HPLT, while being substantially larger) and, in some cases, even performs better than some datasets specifically curated for a single one of these languages, in our diverse set of carefully selected evaluation tasks: FineTasks.

    The dataset is also listed on HF, here is official HF Page.

    "My focus is on sharing this valuable open-source dataset to help AI and ML practitioners easily find resources on Kaggle."

    The detailed information about FW2 is listed in README.md file below ↓

    Acknowledgement

    Hugging Face FW

  9. h

    TF-ID-arxiv-papers

    • huggingface.co
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yifei Hu (2024). TF-ID-arxiv-papers [Dataset]. https://huggingface.co/datasets/yifeihu/TF-ID-arxiv-papers
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 11, 2024
    Authors
    Yifei Hu
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    TF-ID arXiv papers dataset

    This is the dataset for finetuning TF-ID models. It contains about 4,600 images (academic paper pages) with bounding boxes of tables and figures in coco format. The papers are selected from Hugging Face Daily Papers, covering mostly AI/ML/DL related topics. You can use this dataset to reproduce all TF-ID models. All bounding boxes were annotated manually by Yifei Hu

      Project Repo
    

    github.com/ai8hyf/TF-ID

      Variants
    

    Unzip the… See the full description on the dataset page: https://huggingface.co/datasets/yifeihu/TF-ID-arxiv-papers.

  10. h

    XAMI-dataset

    • huggingface.co
    Updated Aug 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elisabeta-Iulia Dima (2024). XAMI-dataset [Dataset]. https://huggingface.co/datasets/iulia-elisa/XAMI-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 26, 2024
    Authors
    Elisabeta-Iulia Dima
    Description

    XAMI: XMM-Newton optical Artefact Mapping for astronomical Instance segmentation The Dataset

    Check the XAMI model and the XAMI dataset on Github.

      Downloading the dataset
    

    using a python script

    from huggingface_hub import hf_hub_download

    dataset_name = 'xami_dataset' # the dataset name of Huggingface images_dir = '.' # the output directory of the dataset images

    hf_hub_download( repo_id="iulia-elisa/XAMI-dataset", # the Huggingface repo ID repo_type='dataset'… See the full description on the dataset page: https://huggingface.co/datasets/iulia-elisa/XAMI-dataset.

  11. h

    disease_data

    • huggingface.co
    Updated Mar 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tushar Milind Bansod (2025). disease_data [Dataset]. https://huggingface.co/datasets/Tusharbansod108/disease_data
    Explore at:
    Dataset updated
    Mar 19, 2025
    Authors
    Tushar Milind Bansod
    Description

    from huggingface_hub import HfApi api = HfApi() api.upload_file( path_or_fileobj="C:\Users\tusha\Desktop\New folder", # Replace with your file path path_in_repo="data.csv", repo_id="Tusharbansod108/disease_data", # Replace with your repo ID repo_type="dataset" )

  12. cross-encoder/nli-distilroberta-base-v2

    • kaggle.com
    zip
    Updated Jul 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ehsan (2021). cross-encoder/nli-distilroberta-base-v2 [Dataset]. https://www.kaggle.com/safavieh/crossencodernlidistilrobertabasev2
    Explore at:
    zip(305434985 bytes)Available download formats
    Dataset updated
    Jul 20, 2021
    Authors
    Ehsan
    Description

    Context

    This is a pretrained transformer that is available in transformers module, from huggingface here:

    https://huggingface.co/cross-encoder/nli-distilroberta-base

    The files in this repository are uploaded from the source from the developers' website:

    https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.2/nli-distilroberta-base-v2.zip

    read the readme.md file in huggingface repo for more info: https://huggingface.co/cross-encoder/nli-distilroberta-base/blob/main/README.md

    also, take a look into sentence-transformers documentation for more models and usage: https://www.sbert.net/docs/pretrained_models.html

    Usage

    The model files are relocated in the 0_Transformer folder.

    example:

    classifier = pipeline("zero-shot-classification",
               model="../input/crossencodernlidistilrobertabasev2/0_Transformer",
               )
    
  13. h

    maniskill_assets

    • huggingface.co
    Updated Oct 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RLinf (2025). maniskill_assets [Dataset]. https://huggingface.co/datasets/RLinf/maniskill_assets
    Explore at:
    Dataset updated
    Oct 6, 2025
    Dataset authored and provided by
    RLinf
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    Asset Download

    The assets need to be placed into RLinf's ManiSkill environment folder with the name assets.

    uv pip install huggingface_hub if you don't have it

    cd

    You can also use git to clone the repository: cd

      License
    

    Our assets are attributed to… See the full description on the dataset page: https://huggingface.co/datasets/RLinf/maniskill_assets.

  14. GLiNER Github Repo

    • kaggle.com
    zip
    Updated Oct 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Darien Schettler (2025). GLiNER Github Repo [Dataset]. https://www.kaggle.com/dschettler8845/gliner-github-repo
    Explore at:
    zip(545226 bytes)Available download formats
    Dataset updated
    Oct 26, 2025
    Authors
    Darien Schettler
    Description

    GLiNER : Generalist and Lightweight model for Named Entity Recognition

    GLiNER is a Named Entity Recognition (NER) model capable of identifying any entity type using a bidirectional transformer encoder (BERT-like). It provides a practical alternative to traditional NER models, which are limited to predefined entities, and Large Language Models (LLMs) that, despite their flexibility, are costly and large for resource-constrained scenarios.

    Demo Image

    Models Status

    📢 Updates

    • 📝 Finetuning notebook is available: examples/finetune.ipynb
    • 🗂 Training dataset preprocessing scripts are now available in the data/ directory, covering both Pile-NER 📚 and NuNER 📘 datasets.

    Available Models on Hugging Face

    To Release

    • [ ] ⏳ GLiNER-Multiv2
    • [ ] ⏳ GLiNER-Sup (trained on mixture of NER datasets)

    Area of improvements / research

    • [ ] Allow longer context (eg. train with long context transformers such as Longformer, LED, etc.)
    • [ ] Use Bi-encoder (entity encoder and span encoder) allowing precompute entity embeddings
    • [ ] Filtering mechanism to reduce number of spans before final classification to save memory and computation when the number entity types is large
    • [ ] Improve understanding of more detailed prompts/instruction, eg. "Find the first name of the person in the text"
    • [ ] Better loss function: for instance use Focal Loss (see this paper) instead of BCE to handle class imbalance, as some entity types are more frequent than others
    • [ ] Improve multi-lingual capabilities: train on more languages, and use multi-lingual training data
    • [ ] Decoding: allow a span to have multiple labels, eg: "Cristiano Ronaldo" is both a "person" and "football player"
    • [ ] Dynamic thresholding (in model.predict_entities(text, labels, threshold=0.5)): allow the model to predict more entities, or less entities, depending on the context. Actually, the model tend to predict less entities where the entity type or the domain are not well represented in the training data.
    • [ ] Train with EMAs (Exponential Moving Averages) or merge multiple checkpoints to improve model robustness (see this paper
    • [ ] Extend the model to relation extraction but need dataset with relation annotations. Our preliminary work ATG.

    Installation

    To use this model, you must install the GLiNER Python library: !pip install gliner

    Usage

    Once you've downloaded the GLiNER library, you can import the GLiNER class. You can then load this model using GLiNER.from_pretrained and predict entities with predict_entities.

    from gliner import GLiNER
    
    model = GLiNER.from_pretrained("urchade/gliner_base")
    
    text = """
    Cristiano Ronaldo dos Santos Aveiro (Portuguese pronunciation: [kɾiʃˈtjɐnu ʁɔˈnaldu]; born 5 February 1985) is a Portuguese professional footballer who plays as a forward for and captains both Saudi Pro League club Al Nassr and the Portugal national team. Widely regarded as one of the greatest players of all time, Ronaldo has won five Ballon d'Or awards,[note 3] a record three UEFA Men's Player of the Year Awards, and four European Golden Shoes, the most by a European player. He has won 33 trophies in his career, including seven league titles, five UEFA Champions Leagues, the UEFA European Championship and the UEFA Nations League. Ronaldo holds the records for most appearances (183), goals (140) and assists (42) in the Champions League, goals in the European Championship (14), international goals (128) and international appearances (205). He is one of the few players to have made over 1,200 professional career appearances, the most by an outfield player, and has scored over 850 offici...
    
  15. h

    cumcm_test

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sxj1024, cumcm_test [Dataset]. https://huggingface.co/datasets/sxj1024/cumcm_test
    Explore at:
    Authors
    sxj1024
    Description

    Download Dataset

    from datasets import load_dataset

    1. Specify the dataset's "repository ID"

    Replace "your-username/your-dataset-name" with the actual ID of the dataset you want to download

    repo_id = "sxj1024/cumcm_test"

    2. Call load_dataset()

    This will automatically download the data from the Hub (if not cached locally),

    then load it into memory (or in streaming mode)

    dataset = load_dataset(repo_id)

    3. View and use the dataset

    print(dataset)

    You can access… See the full description on the dataset page: https://huggingface.co/datasets/sxj1024/cumcm_test.

  16. RemBERT PyTorch

    • kaggle.com
    zip
    Updated Aug 24, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicholas Broad (2021). RemBERT PyTorch [Dataset]. https://www.kaggle.com/nbroad/remBERT-pt
    Explore at:
    zip(2143586380 bytes)Available download formats
    Dataset updated
    Aug 24, 2021
    Authors
    Nicholas Broad
    Description

    REQUIRES transformers>=4.10.0

    Use this dataset and run !pip install -U --no-build-isolation --no-deps ../input/transformers-master/ -qq or do !pip install -U transformers

    RemBERT (for classification)

    Pretrained RemBERT model on 110 languages using a masked language modeling (MLM) objective. It was introduced in the paper Rethinking embedding coupling in pre-trained language models. A direct export of the model checkpoint was first made available in this repository. This version of the checkpoint is lightweight since it is meant to be finetuned for classification and excludes the output embedding weights.

    Model description

    RemBERT's main difference with mBERT is that the input and output embeddings are not tied. Instead, RemBERT uses small input embeddings and larger output embeddings. This makes the model more efficient since the output embeddings are discarded during fine-tuning. It is also more accurate, especially when reinvesting the input embeddings' parameters into the core model, as is done on RemBERT.

    Intended uses & limitations

    You should fine-tune this model for your downstream task. It is meant to be a general-purpose model, similar to mBERT. In our paper, we have successfully applied this model to tasks such as classification, question answering, NER, POS-tagging. For tasks such as text generation you should look at models like GPT2.

    Training data

    The RemBERT model was pretrained on multilingual Wikipedia data over 110 languages. The full language list is on this repository

    https://huggingface.co/google/rembert

  17. h

    CHARM

    • huggingface.co
    Updated Sep 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuze He (2025). CHARM [Dataset]. https://huggingface.co/datasets/hyz317/CHARM
    Explore at:
    Dataset updated
    Sep 26, 2025
    Authors
    Yuze He
    License

    https://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/

    Description

    CHARM

    📃 Paper • 💻 [Github Repo] • 🌐 Project Page
    

    This repository contains the test dataset presented in the paper CHARM: Control-point-based 3D Anime Hairstyle Auto-Regressive Modeling. CHARM is a novel parametric representation and generative framework for anime hairstyle modeling.

      Usage
    

    You can download the files directly from this repository or use the Hugging Face datasets library: from huggingface_hub import hf_hub_download, list_repo_files

    Get list… See the full description on the dataset page: https://huggingface.co/datasets/hyz317/CHARM.

  18. Arxiver Dataset

    • kaggle.com
    • huggingface.co
    zip
    Updated Nov 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saumya Gupta (2024). Arxiver Dataset [Dataset]. https://www.kaggle.com/datasets/saumyagupta2025/arxiver-dataset/data
    Explore at:
    zip(873656728 bytes)Available download formats
    Dataset updated
    Nov 4, 2024
    Authors
    Saumya Gupta
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Arxiver Dataset

    Arxiver consists of 63,357 arXiv papers converted to multi-markdown (.mmd) format. Our dataset includes original arXiv article IDs, titles, abstracts, authors, publication dates, URLs and corresponding markdown files published between January 2023 and October 2023.

    We hope our dataset will be useful for various applications such as semantic search, domain specific language modeling, question answering and summarization.

    Curation

    The Arxiver dataset is created using a neural OCR - Nougat. After OCR processing, we apply custom text processing steps to refine the data. This includes extracting author information, removing reference sections, and performing additional cleaning and formatting. Please refer to our GitHub repo for details.

    References

    The original articles are maintained by arXiv and copyrighted to the original authors, please refer to the arXiv license information page for details. We release our dataset with a Creative Commons Attribution-Noncommercial-ShareAlike (CC BY-NC-SA 4.0) license, if you use this dataset in your research or project, please cite it as follows:

    @misc{acar_arxiver2024,
     author = {Alican Acar, Alara Dirik, Muhammet Hatipoglu},
     title = {ArXiver},
     year = {2024},
     publisher = {Hugging Face},
     howpublished = {\url{https://huggingface.co/datasets/neuralwork/arxiver}}
    }
    
  19. h

    repo_branches

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    git2vec, repo_branches [Dataset]. https://huggingface.co/datasets/git2vec/repo_branches
    Explore at:
    Dataset authored and provided by
    git2vec
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Description

    This dataset captures branch creation history across all GitHub repositories. Each row represents a unique combination of repository ID and branch name, with the timestamp of the first observed push to that branch. This enables analysis of branching strategies, feature branch lifecycles, and development workflow patterns.

      Schema
    

    Column Type Description

    repo_id int64 GitHub's internal repository identifier

    branch_name string Name of the branch… See the full description on the dataset page: https://huggingface.co/datasets/git2vec/repo_branches.

  20. h

    bsd100-set5-set14

    • huggingface.co
    Updated Jul 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    keanteng (2025). bsd100-set5-set14 [Dataset]. https://huggingface.co/datasets/keanteng/bsd100-set5-set14
    Explore at:
    Dataset updated
    Jul 28, 2025
    Authors
    keanteng
    License

    https://choosealicense.com/licenses/agpl-3.0/https://choosealicense.com/licenses/agpl-3.0/

    Description

    This repo contains BSD100, Set5 and Set14 for super resolution evaluation study. To access the zipped file: from huggingface_hub import hf_hub_download

    Replace with the actual repository ID and filename

    repo_id = "keanteng/bsd100-set5-set14" filename = "BSD100.zip" # or Set5.zip and Set14.zip

    local_filepath = hf_hub_download(repo_id=repo_id, filename=filename) print(f"File downloaded to: {local_filepath}")

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
A T M Ragib Raihan (2023). Hugging Face Models [Dataset]. https://www.kaggle.com/datasets/atmragib/hugging-face-models/code
Organization logo

Hugging Face Models

Listings of public machine learning model repository meta data on Hugging Face

Explore at:
zip(13652285 bytes)Available download formats
Dataset updated
Nov 28, 2023
Authors
A T M Ragib Raihan
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Contex

The Hugging Face Hub hosts many models for a variety of machine learning tasks. Models are stored in repositories, so they benefit from all the features possessed by every repo on the Hugging Face Hub.

Data Source Link: huggingface.co/models

Attribute Information

VariableDescription
model_id
pipelineThere are total 40 pipelines. To learn more read: Hugging Face Pipeline
downloads
likes
author_id
author_name
author_typeuser or organization
author_isProPaid user or organization
lastModifiedfrom 2014-08-10 to 2023-11-27
Search
Clear search
Close search
Google apps
Main menu