100+ datasets found
  1. Huggingface Hub Permissible models and datasets

    • kaggle.com
    zip
    Updated Dec 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dheeraj M Pai (2023). Huggingface Hub Permissible models and datasets [Dataset]. https://www.kaggle.com/datasets/dheerajmpai/huggingface-hub-permissible-models-and-datasets
    Explore at:
    zip(34761279 bytes)Available download formats
    Dataset updated
    Dec 26, 2023
    Authors
    Dheeraj M Pai
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Huggingface Hub: Models, Datasets, and Spaces

    Dataset Overview

    This comprehensive dataset contains detailed information about all the models, datasets, and spaces available on the Huggingface Hub. It is an essential resource for anyone looking to explore the extensive range of tools and datasets available for machine learning and AI research.

    Key Features

    • Comprehensive Data: Includes exhaustive details on all models, datasets, and spaces from the Huggingface Hub.
    • Permissible Models: A specialized subset is provided in a separate CSV file, focusing exclusively on models that are permissible for use.
    • Regularly Updated: The dataset is refreshed weekly to ensure the latest information is always available.

    Last Update

    • Date: December 26, 2023

    Update Frequency

    • Frequency: Weekly

    Dataset Contents

    1. Models: Detailed listings of all models available on Huggingface Hub.
    2. Datasets: Comprehensive information on datasets hosted on the Hub.
    3. Spaces: An overview of the different spaces and their functionalities.
    4. Permissible Models CSV: A smaller, curated list of models that are cleared for use.

    Usage

    This dataset is ideal for researchers, developers, and AI enthusiasts who are looking for a one-stop repository of models, datasets, and spaces from the Huggingface Hub. It provides a holistic view and simplifies the task of finding the right tools for various machine learning and AI projects.

    Note: This dataset is not officially affiliated with or endorsed by the Huggingface organization.

  2. h

    TabPalooza

    • huggingface.co
    Updated Sep 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    xyz987 (2025). TabPalooza [Dataset]. https://huggingface.co/datasets/data-hub-xyz987/TabPalooza
    Explore at:
    Dataset updated
    Sep 27, 2025
    Dataset authored and provided by
    xyz987
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    data-hub-xyz987/TabPalooza dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. h

    odaigen_hindi_pre_trained_sp

    • huggingface.co
    Updated Sep 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hindi-data-hub (2025). odaigen_hindi_pre_trained_sp [Dataset]. https://huggingface.co/datasets/Hindi-data-hub/odaigen_hindi_pre_trained_sp
    Explore at:
    Dataset updated
    Sep 22, 2025
    Dataset authored and provided by
    Hindi-data-hub
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Hindi Language Pre-Trained LLM Datasets Overview Welcome to the Hindi Language Pre-Training Datasets repository! This README provides a comprehensive overview of various pre-training datasets available for Hindi, including essential details such as licenses, sources, and statistical information. These datasets are invaluable resources for training and fine-tuning large language models (LLMs) for a wide range of natural language processing (NLP) tasks. -Data Overview and Statistics This README… See the full description on the dataset page: https://huggingface.co/datasets/Hindi-data-hub/odaigen_hindi_pre_trained_sp.

  4. Hugging Face Models

    • kaggle.com
    zip
    Updated Nov 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    A T M Ragib Raihan (2023). Hugging Face Models [Dataset]. https://www.kaggle.com/datasets/atmragib/hugging-face-models/code
    Explore at:
    zip(13652285 bytes)Available download formats
    Dataset updated
    Nov 28, 2023
    Authors
    A T M Ragib Raihan
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Contex

    The Hugging Face Hub hosts many models for a variety of machine learning tasks. Models are stored in repositories, so they benefit from all the features possessed by every repo on the Hugging Face Hub.

    Data Source Link: huggingface.co/models

    Attribute Information

    VariableDescription
    model_id
    pipelineThere are total 40 pipelines. To learn more read: Hugging Face Pipeline
    downloads
    likes
    author_id
    author_name
    author_typeuser or organization
    author_isProPaid user or organization
    lastModifiedfrom 2014-08-10 to 2023-11-27
  5. Huggingface Modelhub

    • kaggle.com
    zip
    Updated Jun 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kartik Godawat (2021). Huggingface Modelhub [Dataset]. https://www.kaggle.com/crazydiv/huggingface-modelhub
    Explore at:
    zip(2274876 bytes)Available download formats
    Dataset updated
    Jun 19, 2021
    Authors
    Kartik Godawat
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://huggingface.co/landing/assets/transformers-docs/huggingface_logo.svg" alt="HuggingFace">

    Dataset containing metadata information of all the publicly uploaded models(10,000+) available on HuggingFace model hub Data was collected between 15-20th June 2021.

    Dataset was generated using huggingface_hub APIs provided by huggingface team.

    Update v3:

    • Added Downloads last month metric
    • Added library name

    Contents:

    • huggingface_models.csv : Primary file which contains metadata information like model name, tags, last modified and filenames
    • huggingface_modelcard_readme.csv : Detailed file containing README.md contents if available for a particular model. Content is in markdown format. modelId column joins both the files together. ### huggingface_models.csv
    • modelId: ID of the model as present on HF website
    • lastModified: Time when this model was last modified
    • tags: Tags associated with the model (provided by mantainer)
    • pipeline_tag: If exists, denotes which pipeline this model could be used with
    • files: List of available files in the model repo
    • publishedBy: Custom column derived from modelID, specifying who published this model
    • downloads_last_month: Number of times the model has been downloaded in last month.
    • library: Name of library the model belongs to eg: transformers, spacy, timm etc. ### huggingface_modelcard_readme.csv
    • modelId: ID of the model as available on HF website
    • modelCard: Readme contents of a model (referred to as modelCard in HuggingFace ecoystem). It contains useful information on how the model was trained, benchmarks and author notes. ### Inspiration: The idea of analyzing publicly available models on HugginFace struck me while I was attending a livesession of the amazing transformers course by @LysandreJik. Soon after, I tweeted the team and asked for permission to create such a dataset. Special shoutout to @osanseviero for encouraging and pointing me in the right direction.

    This is my first dataset upload on Kaggle. I hope you like it. :)

  6. On the Suitability of Hugging Face Hub for Empirical Studies

    • zenodo.org
    • recerca.uoc.edu
    Updated Apr 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adem Ait Fonollà; Adem Ait Fonollà; Javier Luis Cánovas Izquierdo; Javier Luis Cánovas Izquierdo; Jordi Cabot; Jordi Cabot (2024). On the Suitability of Hugging Face Hub for Empirical Studies [Dataset]. http://doi.org/10.5281/zenodo.11072131
    Explore at:
    Dataset updated
    Apr 26, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Adem Ait Fonollà; Adem Ait Fonollà; Javier Luis Cánovas Izquierdo; Javier Luis Cánovas Izquierdo; Jordi Cabot; Jordi Cabot
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains the data used in the paper title "On the Suitability of Hugging Face Hub for Empirical Studies". For RQ1 we share the survey responses and the interview transcription, while for RQ2 we share the link to the repository where the data is hosted.

    • For RQ1, the survey responses are in an Excel file titled "Survey Responses Public.xlsx". The transcriptions of each interview are in a Word file titled "Transcription-intvw-slot-N.docx".
    • For RQ2, we collected the data of the HFCommunity release of October 2023. It can be found in its website. We also share the DOI of the dump.
  7. h

    huggingface-hub-classes

    • huggingface.co
    Updated Jan 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Niels Rogge (2025). huggingface-hub-classes [Dataset]. https://huggingface.co/datasets/nielsr/huggingface-hub-classes
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 15, 2025
    Authors
    Niels Rogge
    Description

    nielsr/huggingface-hub-classes dataset hosted on Hugging Face and contributed by the HF Datasets community

  8. Huggingface Hub 0.27.0

    • kaggle.com
    zip
    Updated Dec 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aishik Rakshit (2024). Huggingface Hub 0.27.0 [Dataset]. https://www.kaggle.com/datasets/aishikai/huggingface-hub-0-27-0
    Explore at:
    zip(438787 bytes)Available download formats
    Dataset updated
    Dec 26, 2024
    Authors
    Aishik Rakshit
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Aishik Rakshit

    Released under Apache 2.0

    Contents

  9. h

    huggingface-hub-docs-chunks-test

    • huggingface.co
    Updated Jan 13, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Niels Rogge (2025). huggingface-hub-docs-chunks-test [Dataset]. https://huggingface.co/datasets/nielsr/huggingface-hub-docs-chunks-test
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 13, 2025
    Authors
    Niels Rogge
    Description

    nielsr/huggingface-hub-docs-chunks-test dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. Z

    GeoEDdA: A Gold Standard Dataset for Named Entity Recognition and Span...

    • data.niaid.nih.gov
    Updated Mar 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Moncla, Ludovic; Vigier, Denis; McDonough, Katherine (2024). GeoEDdA: A Gold Standard Dataset for Named Entity Recognition and Span Categorization Annotations of Diderot & d'Alembert's Encyclopédie [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10530177
    Explore at:
    Dataset updated
    Mar 20, 2024
    Dataset provided by
    Lancaster University
    Laboratoire d'Informatique en Images et Systèmes d'Information
    Interactions, Corpus, Apprentissages, Représentations
    Authors
    Moncla, Ludovic; Vigier, Denis; McDonough, Katherine
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This repository contains a gold standard dataset for named entity recognition and span categorization annotations from Diderot & d’Alembert’s Encyclopédie entries.

    The dataset is available in the following formats:

    JSONL format provided by Prodigy

    binary spaCy format (ready to use with the spaCy train pipeline)

    The Gold Standard dataset is composed of 2,200 paragraphs out of 2,001 Encyclopédie's entries randomly selected. All paragraphs were written in 19th-century French.

    The spans/entities were labeled by the project team along with using pre-labelling with early machine learning models to speed up the labelling process. A train/val/test split was used. Validation and test sets are composed of 200 paragraphs each: 100 classified under 'Géographie' and 100 from another knowledge domain. The datasets have the following breakdown of tokens and spans/entities.

    Tagset

    NC-Spatial: a common noun that identifies a spatial entity (nominal spatial entity) including natural features, e.g. ville, la rivière, royaume.

    NP-Spatial: a proper noun identifying the name of a place (spatial named entities), e.g. France, Paris, la Chine.

    ENE-Spatial: nested spatial entity , e.g. ville de France , royaume de Naples, la mer Baltique.

    Relation: spatial relation, e.g. dans, sur, à 10 lieues de.

    Latlong: geographic coordinates, e.g. Long. 19. 49. lat. 43. 55. 44.

    NC-Person: a common noun that identifies a person (nominal spatial entity), e.g. roi, l'empereur, les auteurs.

    NP-Person: a proper noun identifying the name of a person (person named entities), e.g. Louis XIV, Pline.

    ENE-Person: nested people entity, e.g. le czar Pierre, roi de Macédoine.

    NP-Misc: a proper noun identifying entities not classified as spatial or person, e.g. l'Eglise, 1702, Pélasgique

    ENE-Misc: nested named entity not classified as spatial or person, e.g. l'ordre de S. Jacques, la déclaration du 21 Mars 1671.

    Head: entry name

    Domain-Mark: words indicating the knowledge domain (usually after the head and between parenthesis), e.g. Géographie, Geog., en Anatomie.

    HuggingFace

    The GeoEDdA dataset is available on the HuggingFace Hub: https://huggingface.co/datasets/GEODE/GeoEDdA

    spaCy Custom Spancat trained on Diderot & d’Alembert’s Encyclopédie entries

    This dataset was used to train and evaluate a custom spancat model for French using spaCy. The model is available on HuggingFace's model hub: https://huggingface.co/GEODE/fr_spacy_custom_spancat_edda.

    Acknowledgement

    The authors are grateful to the ASLAN project (ANR-10-LABX-0081) of the Université de Lyon, for its financial support within the French program "Investments for the Future" operated by the National Research Agency (ANR). Data courtesy the ARTFL Encyclopédie Project, University of Chicago.

  11. Labelled Corpus - Political Bias (Hugging Face)

    • kaggle.com
    zip
    Updated May 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suraj Karakulath (2024). Labelled Corpus - Political Bias (Hugging Face) [Dataset]. https://www.kaggle.com/datasets/surajkarakulath/labelled-corpus-political-bias-hugging-face
    Explore at:
    zip(50133530 bytes)Available download formats
    Dataset updated
    May 8, 2024
    Authors
    Suraj Karakulath
    Description

    This is a labeled corpus dataset of article text with corresponding political bias obtained from Huggingface. It contains 17,362 articles labeled left, right, or center by the editors of allsides.com. Articles were manually annotated by news editors who were attempting to select representative articles from the left, right and center of each article topic.

  12. h

    veg-data-hub

    • huggingface.co
    Updated Apr 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subhajit Chatterjee (2025). veg-data-hub [Dataset]. https://huggingface.co/datasets/darthraider/veg-data-hub
    Explore at:
    Dataset updated
    Apr 7, 2025
    Authors
    Subhajit Chatterjee
    Description

    This dataset contains images of vegetables (Bell Pepper, Brinjal, Chile Pepper, Cucumber, New Mexico Green Chile, Pumpkin, Tomato) with various conditions including Damaged, Dried, Old, Ripe, Unripe, Diseased, Flower, Ripe, Rotten, and Unripe states. It is split into training and testing subsets.

  13. OpenOrca

    • kaggle.com
    • opendatalab.com
    • +1more
    zip
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). OpenOrca [Dataset]. https://www.kaggle.com/datasets/thedevastator/open-orca-augmented-flan-dataset/versions/2
    Explore at:
    zip(2548102631 bytes)Available download formats
    Dataset updated
    Nov 22, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Open-Orca Augmented FLAN Dataset

    Unlocking Advanced Language Understanding and ML Model Performance

    By Huggingface Hub [source]

    About this dataset

    The Open-Orca Augmented FLAN Collection is a revolutionary dataset that unlocks new levels of language understanding and machine learning model performance. This dataset was created to support research on natural language processing, machine learning models, and language understanding through leveraging the power of reasoning trace-enhancement techniques. By enabling models to understand complex relationships between words, phrases, and even entire sentences in a more robust way than ever before, this dataset provides researchers expanded opportunities for furthering the progress of linguistics research. With its unique combination of features including system prompts, questions from users and responses from systems, this dataset opens up exciting possibilities for deeper exploration into the cutting edge concepts underlying advanced linguistics applications. Experience a new level of accuracy and performance - explore Open-Orca Augmented FLAN Collection today!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This guide provides an introduction to the Open-Orca Augmented FLAN Collection dataset and outlines how researchers can utilize it for their language understanding and natural language processing (NLP) work. The Open-Orca dataset includes system prompts, questions posed by users, and responses from the system.

    Getting Started The first step is to download the data set from Kaggle at https://www.kaggle.com/openai/open-orca-augmented-flan and save it in a project directory of your choice on your computer or cloud storage space. Once you have downloaded the data set, launch your ‘Jupyter Notebook’ or ‘Google Colab’ program with which you want to work with this data set.

    Exploring & Preprocessing Data: To get a better understanding of the features in this dataset, import them into Pandas DataFrame as shown below. You can use other libraries as per your need:

    import pandas as pd   # Library used for importing datasets into Python 
    
    df = pd.read_csv('train.csv') #Imports train csv file into Pandas};#DataFrame 
    
    df[['system_prompt','question','response']].head() #Views top 5 rows with columns 'system_prompt','question','response'
    

    After importing check each feature using basic descriptive statistics such Pandas groupby statement: We can use groupby statements to have greater clarity over the variables present in each feature(elements). The below command will show counts of each element in System Prompt column present under train CVS file :

     df['system prompt'].value_counts().head()#shows count of each element present under 'System Prompt'column
     Output: User says hello guys 587 <br>System asks How are you?: 555 times<br>User says I am doing good: 487 times <br>..and so on   
    

    Data Transformation: After inspecting & exploring different features one may want/need certain changes that best suits their needs from this dataset before training modeling algorithms on it.
    Common transformation steps include : Removing punctuation marks : Since punctuation marks may not add any value to computation operations , we can remove them using regex functions write .replace('[^A-Za -z]+','' ) as

    Research Ideas

    • Automated Question Answering: Leverage the dataset to train and develop question answering models that can provide tailored answers to specific user queries while retaining language understanding abilities.
    • Natural Language Understanding: Use the dataset as an exploratory tool for fine-tuning natural language processing applications, such as sentiment analysis, document categorization, parts-of-speech tagging and more.
    • Machine Learning Optimizations: The dataset can be used to build highly customized machine learning pipelines that allow users to harness the power of conditioning data with pre-existing rules or models for improved accuracy and performance in automated tasks

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. [See Other Information](ht...

  14. C

    Community-Driven Model Service Platform Report

    • marketreportanalytics.com
    doc, pdf, ppt
    Updated Apr 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Report Analytics (2025). Community-Driven Model Service Platform Report [Dataset]. https://www.marketreportanalytics.com/reports/community-driven-model-service-platform-73131
    Explore at:
    pdf, doc, pptAvailable download formats
    Dataset updated
    Apr 9, 2025
    Dataset authored and provided by
    Market Report Analytics
    License

    https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    Discover the booming Community-Driven Model Service Platform market! This comprehensive analysis reveals a CAGR of 10.1%, driven by AI adoption and open-source innovation. Explore market size, trends, segmentation (cloud, on-premises, adult, children), key players (Kaggle, GitHub, Hugging Face), and regional insights. Learn more about this rapidly expanding sector.

  15. h

    hub_models_with_base_model_info

    • huggingface.co
    Updated Dec 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Librarian Bots (2023). hub_models_with_base_model_info [Dataset]. https://huggingface.co/datasets/librarian-bots/hub_models_with_base_model_info
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 3, 2023
    Dataset authored and provided by
    Librarian Bots
    Description

    Dataset Card for Hugging Face Hub Models with Base Model Metadata

      Dataset Details
    

    This dataset contains a subset of possible metadata for models hosted on the Hugging Face Hub. All of these models contain base_model metadata i.e. information about the model used for fine-tuning. This data can be used for creating network graphs showing links between models on the Hub.

      Dataset Description
    

    Curated by: [More Information Needed] Funded by [optional]: [More… See the full description on the dataset page: https://huggingface.co/datasets/librarian-bots/hub_models_with_base_model_info.

  16. h

    webui-all

    • huggingface.co
    Updated Nov 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Big Lab (2024). webui-all [Dataset]. https://huggingface.co/datasets/biglab/webui-all
    Explore at:
    Dataset updated
    Nov 1, 2024
    Dataset authored and provided by
    Big Lab
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    This data accompanies the WebUI project (https://dl.acm.org/doi/abs/10.1145/3544548.3581158) For more information, check out the project website: https://uimodeling.github.io/ To download this dataset, you need to install the huggingface-hub package pip install huggingface-hub

    Use snapshot_download from huggingface_hub import snapshot_download snapshot_download(repo_id="biglab/webui-all", repo_type="dataset")

    IMPORTANT

    Before downloading and using, please review the copyright info here:… See the full description on the dataset page: https://huggingface.co/datasets/biglab/webui-all.

  17. h

    dataset-formats

    • huggingface.co
    Updated Apr 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sylvain Lesage (2024). dataset-formats [Dataset]. https://huggingface.co/datasets/severo/dataset-formats
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 17, 2024
    Authors
    Sylvain Lesage
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Datasets formats on the Hugging Face Hub

    Every day, we check the proportion of data formats among the datasets published on Hugging Face. The data is published at https://huggingface.co/datasets/severo/dataset-formats. The count includes all the datasets supported by the dataset viewer, and only for the supported formats. By dataset format, we refer to the native format of the data. All the supported datasets are also available as Parquet. See… See the full description on the dataset page: https://huggingface.co/datasets/severo/dataset-formats.

  18. h

    dev-push-to-hub

    • huggingface.co
    Updated Aug 31, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ashim Mahara (2021). dev-push-to-hub [Dataset]. https://huggingface.co/datasets/ashim/dev-push-to-hub
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 31, 2021
    Authors
    Ashim Mahara
    Description

    ashim/dev-push-to-hub dataset hosted on Hugging Face and contributed by the HF Datasets community

  19. h

    isha-call-center-qa-data

    • huggingface.co
    Updated Jul 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Natesh Bhat (2023). isha-call-center-qa-data [Dataset]. https://huggingface.co/datasets/nateshmbhat/isha-call-center-qa-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 31, 2023
    Authors
    Natesh Bhat
    Description

    nateshmbhat/isha-call-center-qa-data dataset hosted on Hugging Face and contributed by the HF Datasets community

  20. h

    fineweb

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FineData, fineweb [Dataset]. http://doi.org/10.57967/hf/2493
    Explore at:
    Dataset authored and provided by
    FineData
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    🍷 FineWeb

    15 trillion tokens of the finest data the 🌐 web has to offer

      What is it?
    

    The 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Dheeraj M Pai (2023). Huggingface Hub Permissible models and datasets [Dataset]. https://www.kaggle.com/datasets/dheerajmpai/huggingface-hub-permissible-models-and-datasets
Organization logo

Huggingface Hub Permissible models and datasets

Huggingface Hub models, datasets and spaces.

Explore at:
zip(34761279 bytes)Available download formats
Dataset updated
Dec 26, 2023
Authors
Dheeraj M Pai
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Huggingface Hub: Models, Datasets, and Spaces

Dataset Overview

This comprehensive dataset contains detailed information about all the models, datasets, and spaces available on the Huggingface Hub. It is an essential resource for anyone looking to explore the extensive range of tools and datasets available for machine learning and AI research.

Key Features

  • Comprehensive Data: Includes exhaustive details on all models, datasets, and spaces from the Huggingface Hub.
  • Permissible Models: A specialized subset is provided in a separate CSV file, focusing exclusively on models that are permissible for use.
  • Regularly Updated: The dataset is refreshed weekly to ensure the latest information is always available.

Last Update

  • Date: December 26, 2023

Update Frequency

  • Frequency: Weekly

Dataset Contents

  1. Models: Detailed listings of all models available on Huggingface Hub.
  2. Datasets: Comprehensive information on datasets hosted on the Hub.
  3. Spaces: An overview of the different spaces and their functionalities.
  4. Permissible Models CSV: A smaller, curated list of models that are cleared for use.

Usage

This dataset is ideal for researchers, developers, and AI enthusiasts who are looking for a one-stop repository of models, datasets, and spaces from the Huggingface Hub. It provides a holistic view and simplifies the task of finding the right tools for various machine learning and AI projects.

Note: This dataset is not officially affiliated with or endorsed by the Huggingface organization.

Search
Clear search
Close search
Google apps
Main menu