Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This comprehensive dataset contains detailed information about all the models, datasets, and spaces available on the Huggingface Hub. It is an essential resource for anyone looking to explore the extensive range of tools and datasets available for machine learning and AI research.
This dataset is ideal for researchers, developers, and AI enthusiasts who are looking for a one-stop repository of models, datasets, and spaces from the Huggingface Hub. It provides a holistic view and simplifies the task of finding the right tools for various machine learning and AI projects.
Note: This dataset is not officially affiliated with or endorsed by the Huggingface organization.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
data-hub-xyz987/TabPalooza dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Hindi Language Pre-Trained LLM Datasets Overview Welcome to the Hindi Language Pre-Training Datasets repository! This README provides a comprehensive overview of various pre-training datasets available for Hindi, including essential details such as licenses, sources, and statistical information. These datasets are invaluable resources for training and fine-tuning large language models (LLMs) for a wide range of natural language processing (NLP) tasks. -Data Overview and Statistics This README… See the full description on the dataset page: https://huggingface.co/datasets/Hindi-data-hub/odaigen_hindi_pre_trained_sp.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The Hugging Face Hub hosts many models for a variety of machine learning tasks. Models are stored in repositories, so they benefit from all the features possessed by every repo on the Hugging Face Hub.
| Variable | Description |
|---|---|
| model_id | |
| pipeline | There are total 40 pipelines. To learn more read: Hugging Face Pipeline |
| downloads | |
| likes | |
| author_id | |
| author_name | |
| author_type | user or organization |
| author_isPro | Paid user or organization |
| lastModified | from 2014-08-10 to 2023-11-27 |
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://huggingface.co/landing/assets/transformers-docs/huggingface_logo.svg" alt="HuggingFace">
Dataset containing metadata information of all the publicly uploaded models(10,000+) available on HuggingFace model hub Data was collected between 15-20th June 2021.
Dataset was generated using huggingface_hub APIs provided by huggingface team.
This is my first dataset upload on Kaggle. I hope you like it. :)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the data used in the paper title "On the Suitability of Hugging Face Hub for Empirical Studies". For RQ1 we share the survey responses and the interview transcription, while for RQ2 we share the link to the repository where the data is hosted.
Facebook
Twitternielsr/huggingface-hub-classes dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Aishik Rakshit
Released under Apache 2.0
Facebook
Twitternielsr/huggingface-hub-docs-chunks-test dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This repository contains a gold standard dataset for named entity recognition and span categorization annotations from Diderot & d’Alembert’s Encyclopédie entries.
The dataset is available in the following formats:
JSONL format provided by Prodigy
binary spaCy format (ready to use with the spaCy train pipeline)
The Gold Standard dataset is composed of 2,200 paragraphs out of 2,001 Encyclopédie's entries randomly selected. All paragraphs were written in 19th-century French.
The spans/entities were labeled by the project team along with using pre-labelling with early machine learning models to speed up the labelling process. A train/val/test split was used. Validation and test sets are composed of 200 paragraphs each: 100 classified under 'Géographie' and 100 from another knowledge domain. The datasets have the following breakdown of tokens and spans/entities.
Tagset
NC-Spatial: a common noun that identifies a spatial entity (nominal spatial entity) including natural features, e.g. ville, la rivière, royaume.
NP-Spatial: a proper noun identifying the name of a place (spatial named entities), e.g. France, Paris, la Chine.
ENE-Spatial: nested spatial entity , e.g. ville de France , royaume de Naples, la mer Baltique.
Relation: spatial relation, e.g. dans, sur, à 10 lieues de.
Latlong: geographic coordinates, e.g. Long. 19. 49. lat. 43. 55. 44.
NC-Person: a common noun that identifies a person (nominal spatial entity), e.g. roi, l'empereur, les auteurs.
NP-Person: a proper noun identifying the name of a person (person named entities), e.g. Louis XIV, Pline.
ENE-Person: nested people entity, e.g. le czar Pierre, roi de Macédoine.
NP-Misc: a proper noun identifying entities not classified as spatial or person, e.g. l'Eglise, 1702, Pélasgique
ENE-Misc: nested named entity not classified as spatial or person, e.g. l'ordre de S. Jacques, la déclaration du 21 Mars 1671.
Head: entry name
Domain-Mark: words indicating the knowledge domain (usually after the head and between parenthesis), e.g. Géographie, Geog., en Anatomie.
HuggingFace
The GeoEDdA dataset is available on the HuggingFace Hub: https://huggingface.co/datasets/GEODE/GeoEDdA
spaCy Custom Spancat trained on Diderot & d’Alembert’s Encyclopédie entries
This dataset was used to train and evaluate a custom spancat model for French using spaCy. The model is available on HuggingFace's model hub: https://huggingface.co/GEODE/fr_spacy_custom_spancat_edda.
Acknowledgement
The authors are grateful to the ASLAN project (ANR-10-LABX-0081) of the Université de Lyon, for its financial support within the French program "Investments for the Future" operated by the National Research Agency (ANR). Data courtesy the ARTFL Encyclopédie Project, University of Chicago.
Facebook
TwitterThis is a labeled corpus dataset of article text with corresponding political bias obtained from Huggingface. It contains 17,362 articles labeled left, right, or center by the editors of allsides.com. Articles were manually annotated by news editors who were attempting to select representative articles from the left, right and center of each article topic.
Facebook
TwitterThis dataset contains images of vegetables (Bell Pepper, Brinjal, Chile Pepper, Cucumber, New Mexico Green Chile, Pumpkin, Tomato) with various conditions including Damaged, Dried, Old, Ripe, Unripe, Diseased, Flower, Ripe, Rotten, and Unripe states. It is split into training and testing subsets.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
The Open-Orca Augmented FLAN Collection is a revolutionary dataset that unlocks new levels of language understanding and machine learning model performance. This dataset was created to support research on natural language processing, machine learning models, and language understanding through leveraging the power of reasoning trace-enhancement techniques. By enabling models to understand complex relationships between words, phrases, and even entire sentences in a more robust way than ever before, this dataset provides researchers expanded opportunities for furthering the progress of linguistics research. With its unique combination of features including system prompts, questions from users and responses from systems, this dataset opens up exciting possibilities for deeper exploration into the cutting edge concepts underlying advanced linguistics applications. Experience a new level of accuracy and performance - explore Open-Orca Augmented FLAN Collection today!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This guide provides an introduction to the Open-Orca Augmented FLAN Collection dataset and outlines how researchers can utilize it for their language understanding and natural language processing (NLP) work. The Open-Orca dataset includes system prompts, questions posed by users, and responses from the system.
Getting Started The first step is to download the data set from Kaggle at https://www.kaggle.com/openai/open-orca-augmented-flan and save it in a project directory of your choice on your computer or cloud storage space. Once you have downloaded the data set, launch your ‘Jupyter Notebook’ or ‘Google Colab’ program with which you want to work with this data set.
Exploring & Preprocessing Data: To get a better understanding of the features in this dataset, import them into Pandas DataFrame as shown below. You can use other libraries as per your need:
import pandas as pd # Library used for importing datasets into Python df = pd.read_csv('train.csv') #Imports train csv file into Pandas};#DataFrame df[['system_prompt','question','response']].head() #Views top 5 rows with columns 'system_prompt','question','response'After importing check each feature using basic descriptive statistics such Pandas groupby statement: We can use groupby statements to have greater clarity over the variables present in each feature(elements). The below command will show counts of each element in System Prompt column present under train CVS file :
df['system prompt'].value_counts().head()#shows count of each element present under 'System Prompt'column Output: User says hello guys 587 <br>System asks How are you?: 555 times<br>User says I am doing good: 487 times <br>..and so onData Transformation: After inspecting & exploring different features one may want/need certain changes that best suits their needs from this dataset before training modeling algorithms on it.
Common transformation steps include : Removing punctuation marks : Since punctuation marks may not add any value to computation operations , we can remove them using regex functions write .replace('[^A-Za -z]+','' ) as
- Automated Question Answering: Leverage the dataset to train and develop question answering models that can provide tailored answers to specific user queries while retaining language understanding abilities.
- Natural Language Understanding: Use the dataset as an exploratory tool for fine-tuning natural language processing applications, such as sentiment analysis, document categorization, parts-of-speech tagging and more.
- Machine Learning Optimizations: The dataset can be used to build highly customized machine learning pipelines that allow users to harness the power of conditioning data with pre-existing rules or models for improved accuracy and performance in automated tasks
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. [See Other Information](ht...
Facebook
Twitterhttps://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
Discover the booming Community-Driven Model Service Platform market! This comprehensive analysis reveals a CAGR of 10.1%, driven by AI adoption and open-source innovation. Explore market size, trends, segmentation (cloud, on-premises, adult, children), key players (Kaggle, GitHub, Hugging Face), and regional insights. Learn more about this rapidly expanding sector.
Facebook
TwitterDataset Card for Hugging Face Hub Models with Base Model Metadata
Dataset Details
This dataset contains a subset of possible metadata for models hosted on the Hugging Face Hub. All of these models contain base_model metadata i.e. information about the model used for fine-tuning. This data can be used for creating network graphs showing links between models on the Hub.
Dataset Description
Curated by: [More Information Needed] Funded by [optional]: [More… See the full description on the dataset page: https://huggingface.co/datasets/librarian-bots/hub_models_with_base_model_info.
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
This data accompanies the WebUI project (https://dl.acm.org/doi/abs/10.1145/3544548.3581158) For more information, check out the project website: https://uimodeling.github.io/ To download this dataset, you need to install the huggingface-hub package pip install huggingface-hub
Use snapshot_download from huggingface_hub import snapshot_download snapshot_download(repo_id="biglab/webui-all", repo_type="dataset")
IMPORTANT
Before downloading and using, please review the copyright info here:… See the full description on the dataset page: https://huggingface.co/datasets/biglab/webui-all.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Datasets formats on the Hugging Face Hub
Every day, we check the proportion of data formats among the datasets published on Hugging Face. The data is published at https://huggingface.co/datasets/severo/dataset-formats. The count includes all the datasets supported by the dataset viewer, and only for the supported formats. By dataset format, we refer to the native format of the data. All the supported datasets are also available as Parquet. See… See the full description on the dataset page: https://huggingface.co/datasets/severo/dataset-formats.
Facebook
Twitterashim/dev-push-to-hub dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitternateshmbhat/isha-call-center-qa-data dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
🍷 FineWeb
15 trillion tokens of the finest data the 🌐 web has to offer
What is it?
The 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This comprehensive dataset contains detailed information about all the models, datasets, and spaces available on the Huggingface Hub. It is an essential resource for anyone looking to explore the extensive range of tools and datasets available for machine learning and AI research.
This dataset is ideal for researchers, developers, and AI enthusiasts who are looking for a one-stop repository of models, datasets, and spaces from the Huggingface Hub. It provides a holistic view and simplifies the task of finding the right tools for various machine learning and AI projects.
Note: This dataset is not officially affiliated with or endorsed by the Huggingface organization.