100+ datasets found

P
Meta-Dataset Dataset
paperswithcode.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eleni Triantafillou; Tyler Zhu; Vincent Dumoulin; Pascal Lamblin; Utku Evci; Kelvin Xu; Ross Goroshin; Carles Gelada; Kevin Swersky; Pierre-Antoine Manzagol; Hugo Larochelle, Meta-Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/meta-dataset
Explore at:
Authors
Eleni Triantafillou; Tyler Zhu; Vincent Dumoulin; Pascal Lamblin; Utku Evci; Kelvin Xu; Ross Goroshin; Carles Gelada; Kevin Swersky; Pierre-Antoine Manzagol; Hugo Larochelle
Description
The Meta-Dataset benchmark is a large few-shot learning benchmark and consists of multiple datasets of different data distributions. It does not restrict few-shot tasks to have fixed ways and shots, thus representing a more realistic scenario. It consists of 10 datasets from diverse domains:

ILSVRC-2012 (the ImageNet dataset, consisting of natural images with 1000 categories) Omniglot (hand-written characters, 1623 classes) Aircraft (dataset of aircraft images, 100 classes) CUB-200-2011 (dataset of Birds, 200 classes) Describable Textures (different kinds of texture images with 43 categories) Quick Draw (black and white sketches of 345 different categories) Fungi (a large dataset of mushrooms with 1500 categories) VGG Flower (dataset of flower images with 102 categories), Traffic Signs (German traffic sign images with 43 classes) MSCOCO (images collected from Flickr, 80 classes).

All datasets except Traffic signs and MSCOCO have a training, validation and test split (proportioned roughly into 70%, 15%, 15%). The datasets Traffic Signs and MSCOCO are reserved for testing only.
Shells or Pebbles: An Image Classification Dataset
kaggle.com
Updated Aug 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marionette 👺 (2022). Shells or Pebbles: An Image Classification Dataset [Dataset]. https://www.kaggle.com/datasets/vencerlanz09/shells-or-pebbles-an-image-classification-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 28, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Marionette 👺
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Overview

The dataset contains two classes: Shells or Pebbles. This dataset can be used to for binary classification tasks to determine whether a certain image constitutes as a shell or a pebble. Cover Image by wirestock on Freepik

Inspiration

I found it cool to create an app with a CV algorithm that could classify whether a certain picture is a shell or image. The next time that I would be visiting a beach, I could just use the app to help me collect either shells or pebbles. 😄
d
Job Postings Dataset for Labour Market Research and Insights
datarade.ai
Updated Sep 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oxylabs (2023). Job Postings Dataset for Labour Market Research and Insights [Dataset]. https://datarade.ai/data-products/job-postings-dataset-for-labour-market-research-and-insights-oxylabs
Explore at:
.json, .xml, .csv, .xlsAvailable download formats
Dataset updated
Sep 20, 2023
Dataset authored and provided by
Oxylabs
Area covered
Kyrgyzstan, Switzerland, Togo, Jamaica, Sierra Leone, Zambia, Anguilla, Luxembourg, Tajikistan, British Indian Ocean Territory
Description
Introducing Job Posting Datasets: Uncover labor market insights!

Elevate your recruitment strategies, forecast future labor industry trends, and unearth investment opportunities with Job Posting Datasets.

Job Posting Datasets Source:

Indeed: Access datasets from Indeed, a leading employment website known for its comprehensive job listings.

Glassdoor: Receive ready-to-use employee reviews, salary ranges, and job openings from Glassdoor.

StackShare: Access StackShare datasets to make data-driven technology decisions.

Job Posting Datasets provide meticulously acquired and parsed data, freeing you to focus on analysis. You'll receive clean, structured, ready-to-use job posting data, including job titles, company names, seniority levels, industries, locations, salaries, and employment types.

Choose your preferred dataset delivery options for convenience:

Receive datasets in various formats, including CSV, JSON, and more. Opt for storage solutions such as AWS S3, Google Cloud Storage, and more. Customize data delivery frequencies, whether one-time or per your agreed schedule.

Why Choose Oxylabs Job Posting Datasets:

Fresh and accurate data: Access clean and structured job posting datasets collected by our seasoned web scraping professionals, enabling you to dive into analysis.

Time and resource savings: Focus on data analysis and your core business objectives while we efficiently handle the data extraction process cost-effectively.

Customized solutions: Tailor our approach to your business needs, ensuring your goals are met.

Legal compliance: Partner with a trusted leader in ethical data collection. Oxylabs is a founding member of the Ethical Web Data Collection Initiative, aligning with GDPR and CCPA best practices.

Pricing Options:

Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.

Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.

Experience a seamless journey with Oxylabs:

Understanding your data needs: We work closely to understand your business nature and daily operations, defining your unique data requirements.

Developing a customized solution: Our experts create a custom framework to extract public data using our in-house web scraping infrastructure.

Delivering data sample: We provide a sample for your feedback on data quality and the entire delivery process.

Continuous data delivery: We continuously collect public data and deliver custom datasets per the agreed frequency.

Effortlessly access fresh job posting data with Oxylabs Job Posting Datasets.
h
AirfRANS_original
huggingface.co
Updated May 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PLAID-datasets (2025). AirfRANS_original [Dataset]. https://huggingface.co/datasets/PLAID-datasets/AirfRANS_original
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 5, 2025
Dataset authored and provided by
PLAID-datasets
License
https://choosealicense.com/licenses/odbl/https://choosealicense.com/licenses/odbl/
Description
Dataset Card

This dataset contains a single huggingface split, named 'all_samples'. The samples contains a single huggingface feature, named called "sample". Samples are instances of plaid.containers.sample.Sample. Mesh objects included in samples follow the CGNS standard, and can be converted in Muscat.Containers.Mesh.Mesh. Example of commands: import pickle from datasets import load_dataset from plaid.containers.sample import Sample

Load the dataset

dataset =… See the full description on the dataset page: https://huggingface.co/datasets/PLAID-datasets/AirfRANS_original.
D
History of work (all graph datasets)
druid.datalegend.net
api.druid.datalegend.net
+1more
application/n-quads +5
Updated Jul 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
History of Work (2025). History of work (all graph datasets) [Dataset]. https://druid.datalegend.net/HistoryOfWork/historyOfWork-all-latest
Explore at:
application/n-quads, application/n-triples, application/trig, ttl, jsonld, application/sparql-results+jsonAvailable download formats
Dataset updated
Jul 17, 2025
Dataset authored and provided by
History of Work
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
History of Work

Here you find the History of Work resources as Linked Open Data. It enables you to look ups for HISCO and HISCAM scores for an incredible amount of occupational titles in numerous languages.

Data can be queried (obtained) via the SPARQL endpoint or via the example queries. If the Linked Open Data format is new to you, you might enjoy these data stories on History of Work as Linked Open Data and this user question on Is there a list of female occupations?.

NEW version - CHANGE notes

This version is dated Apr 2025 and is not backwards compatible with the previous version (Feb 2021). The major changes are: - incredible simplification of graph representation (from 81 to 12); - use of sdo (https://schema.org/) rather than schema (http://schema.org); - replacement of prov:wasDerivedFrom with sdo:isPartOf to link occupational titles to originating datasets; - etl files (used for conversion to Linked Data) now publicly available via https://github.com/rlzijdeman/rdf-hisco; - update of issues with language tags; - specfication of language tags for english (eg. @en-gb, instead of @en); - new preferred API: https://api.druid.datalegend.net/datasets/HistoryOfWork/historyOfWork-all-latest/sparql (old API will be deprecated at some point: https://api.druid.datalegend.net/datasets/HistoryOfWork/historyOfWork-all-latest/services/historyOfWork-all-latest/sparql ) .

There are bound to be some issues. Please leave report them here.

Figure 1. Part of model illustrating the basic relation between occupations, schema.org and HISCO. https://druid.datalegend.net/HistoryOfWork/historyOfWork-all-latest/assets/601beed0f7d371035bca5521" alt="hisco-basic">

Figure 2. Part of model illustrating the relation between occupation, provenance and HISCO auxiliary variables. https://druid.datalegend.net/HistoryOfWork/historyOfWork-all-latest/assets/601beed0f7d371035bca551e" alt="hisco-aux">
Pokemon Go
kaggle.com
Updated Aug 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shreya Sur965 (2024). Pokemon Go [Dataset]. https://www.kaggle.com/datasets/shreyasur965/pokemon-go
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 5, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Shreya Sur965
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset provides detailed information on 1007 Pokémon from the popular mobile game Pokémon GO. It includes a wide range of attributes such as base stats, move sets, rarity, and acquisition methods. The data was collected using the RapidAPI Pokémon GO API, offering researchers and data enthusiasts a rich resource for analysis, machine learning projects, and game strategy development.

Key features of this dataset include:

Comprehensive coverage of 1007 Pokémon

24 attributes for each Pokémon, including battle stats, type, and rarity

Information on acquisition methods (wild, egg, raid, etc.)

Move set details for both fast and charged moves

Game mechanics data such as capture and flee rates

This dataset is ideal for:

Analyzing Pokémon strengths and weaknesses

Developing machine learning models for Pokémon classification or prediction

Studying game balance and design in Pokémon GO

Creating tools for players to optimize their gameplay strategies

Whether you're a data scientist, game developer, or Pokémon enthusiast, this dataset offers a wealth of information to explore and analyze the world of Pokémon GO.
f
Orange dataset table
figshare.com
xlsx
Updated Mar 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rui Simões (2022). Orange dataset table [Dataset]. http://doi.org/10.6084/m9.figshare.19146410.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19146410.v1
Dataset updated
Mar 4, 2022
Dataset provided by
figshare
Authors
Rui Simões
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The complete dataset used in the analysis comprises 36 samples, each described by 11 numeric features and 1 target. The attributes considered were caspase 3/7 activity, Mitotracker red CMXRos area and intensity (3 h and 24 h incubations with both compounds), Mitosox oxidation (3 h incubation with the referred compounds) and oxidation rate, DCFDA fluorescence (3 h and 24 h incubations with either compound) and oxidation rate, and DQ BSA hydrolysis. The target of each instance corresponds to one of the 9 possible classes (4 samples per class): Control, 6.25, 12.5, 25 and 50 µM for 6-OHDA and 0.03, 0.06, 0.125 and 0.25 µM for rotenone. The dataset is balanced, it does not contain any missing values and data was standardized across features. The small number of samples prevented a full and strong statistical analysis of the results. Nevertheless, it allowed the identification of relevant hidden patterns and trends.

Exploratory data analysis, information gain, hierarchical clustering, and supervised predictive modeling were performed using Orange Data Mining version 3.25.1 [41]. Hierarchical clustering was performed using the Euclidean distance metric and weighted linkage. Cluster maps were plotted to relate the features with higher mutual information (in rows) with instances (in columns), with the color of each cell representing the normalized level of a particular feature in a specific instance. The information is grouped both in rows and in columns by a two-way hierarchical clustering method using the Euclidean distances and average linkage. Stratified cross-validation was used to train the supervised decision tree. A set of preliminary empirical experiments were performed to choose the best parameters for each algorithm, and we verified that, within moderate variations, there were no significant changes in the outcome. The following settings were adopted for the decision tree algorithm: minimum number of samples in leaves: 2; minimum number of samples required to split an internal node: 5; stop splitting when majority reaches: 95%; criterion: gain ratio. The performance of the supervised model was assessed using accuracy, precision, recall, F-measure and area under the ROC curve (AUC) metrics.
i
UCI datasets
ieee-dataport.org
Updated May 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuan Sun (2025). UCI datasets [Dataset]. https://ieee-dataport.org/documents/uci-datasets
Explore at:
Dataset updated
May 14, 2025
Authors
Yuan Sun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
biology
d
Ecological community datasets used to evaluate the presence of trends in...
catalog.data.gov
data.usgs.gov
+1more
Updated Jul 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Ecological community datasets used to evaluate the presence of trends in ecological communities in selected rivers and streams across the United States, 1992-2012 (input) [Dataset]. https://catalog.data.gov/dataset/ecological-community-datasets-used-to-evaluate-the-presence-of-trends-in-ecological-commun-1bb76
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
United States
Description
In 1991, the U.S. Geological Survey (USGS) began a study of more than 50 major river basins across the Nation as part of the National Water-Quality Assessment (NAWQA) project of the National Water-Quality Program. One of the major goals of the NAWQA project is to determine how water-quality and ecological conditions change over time. To support that goal, long-term consistent and comparable ecological monitoring has been conducted on streams and rivers throughout the Nation. Fish, invertebrate, and diatom data collected as part of the NAWQA program were retrieved from the USGS Aquatic Bioassessment database for use in trend analysis. Ultimately, these data will provide insight into how natural features and human activities have contributed to changes in ecological condition over time in the Nation’s streams and rivers. This USGS data release contains all of the input and output files necessary to reproduce the results of the ecological trend analysis described in the associated U.S. Geological Survey Scientific Investigations Report. Data preparation for input to the model is also fully described in the above mentioned report.
T
celeb_a_hq
tensorflow.org
opendatalab.com
Updated Jun 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). celeb_a_hq [Dataset]. https://www.tensorflow.org/datasets/catalog/celeb_a_hq
Explore at:
Dataset updated
Jun 1, 2024
Description
High-quality version of the CELEBA dataset, consisting of 30000 images in 1024 x 1024 resolution.

Note: CelebAHQ dataset may contain potential bias. The fairness indicators example goes into detail about several considerations to keep in mind while using the CelebAHQ dataset.

WARNING: This dataset currently requires you to prepare images on your own.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('celeb_a_hq', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/celeb_a_hq-1024-2.0.0.png" alt="Visualization" width="500px">
Dataset relating a study on Geospatial Open Data usage and metadata quality
zenodo.org
data.niaid.nih.gov
Updated Jun 19, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alfonso Quarati; Alfonso Quarati; Monica De Martino; Monica De Martino (2023). Dataset relating a study on Geospatial Open Data usage and metadata quality [Dataset]. http://doi.org/10.5281/zenodo.4280594
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4280594
Dataset updated
Jun 19, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alfonso Quarati; Alfonso Quarati; Monica De Martino; Monica De Martino
Description
The Open Government Data portals (OGD) thanks to the presence of thousands of geo-referenced datasets, containing spatial information, are of extreme interest for any analysis or process relating to the territory. For this to happen, users must be enabled to access these datasets and reuse them. An element often considered hindering the full dissemination of OGD data is the quality of their metadata. Starting from an experimental investigation conducted on over 160,000 geospatial datasets belonging to six national and international OGD portals, this work has as its first objective to provide an overview of the usage of these portals measured in terms of datasets views and downloads. Furthermore, to assess the possible influence of the quality of the metadata on the use of geospatial datasets, an assessment of the metadata for each dataset was carried out, and the correlation between these two variables was measured. The results obtained showed a significant underutilization of geospatial datasets and a generally poor quality of their metadata. Besides, a weak correlation was found between the use and quality of the metadata, not such as to assert with certainty that the latter is a determining factor of the former.

The dataset consists of six zipped CSV files, containing the collected datasets' usage data, full metadata, and computed quality values, for about 160,000 geospatial datasets belonging to the three national and three international portals considered in the study, i.e. US (catalog.data.gov), Colombia (datos.gov.co), Ireland (data.gov.ie), HDX (data.humdata.org), EUODP (data.europa.eu), and NASA (data.nasa.gov).

Data collection occurred in the period: 2019-12-19 -- 2019-12-23.

The header for each CSV file is:

[ ,portalid,id,downloaddate,metadata,overallq,qvalues,assessdate,dviews,downloads,engine,admindomain]

where for each row (a portal's dataset) the following fields are defined as follows:

portalid: portal identifier

id: dataset identifier

downloaddate: date of data collection

metadata: the overall dataset's metadata downloaded via API from the portal according to the supporting platform schema

overallq: overall quality values computed by applying the methodology presented in [1]

qvalues: json object containing the quality values computed for the 17 metrics presented in [1]

assessdate: date of quality assessment

dviews: number of total views for the dataset

downloads: number of total downloads for the dataset (made available only by the Colombia, HDX, and NASA portals)

engine: identifier of the supporting portal platform: 1(CKAN), 2 (Socrata)

admindomain: 1 (national), 2 (international)

[1] Neumaier, S.; Umbrich, J.; Polleres, A. Automated Quality Assessment of Metadata Across Open Data Portals.J. Data and Information Quality2016,8, 2:1–2:29. doi:10.1145/2964909
h
aclsum
huggingface.co
Updated Feb 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sotaro takeshita (2024). aclsum [Dataset]. https://huggingface.co/datasets/sobamchan/aclsum
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 27, 2024
Authors
sotaro takeshita
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
ACLSum: A New Dataset for Aspect-based Summarization of Scientific Publications

This repository contains data for our paper "ACLSum: A New Dataset for Aspect-based Summarization of Scientific Publications" and a small utility class to work with it.

HuggingFace datasets

You can also use Huggin Face datasets to load ACLSum (dataset link). This would be convenient if you want to train transformer models using our dataset. Just do, from datasets import load_dataset dataset =… See the full description on the dataset page: https://huggingface.co/datasets/sobamchan/aclsum.
Physical Exercise Recognition Dataset
kaggle.com
Updated Feb 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhannad Tuameh (2023). Physical Exercise Recognition Dataset [Dataset]. https://www.kaggle.com/datasets/muhannadtuameh/exercise-recognition
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 16, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Muhannad Tuameh
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Note:

Because this dataset has been used in a competition, we had to hide some of the data to prepare the test dataset for the competition. Thus, in the previous version of the dataset, only train.csv file is existed.

Content

This dataset represents 10 different physical poses that can be used to distinguish 5 exercises. The exercises are Push-up, Pull-up, Sit-up, Jumping Jack and Squat. For every exercise, 2 different classes have been used to represent the terminal positions of that exercise (e.g., “up” and “down” positions for push-ups).

Collection Process

About 500 videos of people doing the exercises have been used in order to collect this data. The videos are from Countix Dataset that contain the YouTube links of several human activity videos. Using a simple Python script, the videos of 5 different physical exercises are downloaded. From every video, at least 2 frames are manually extracted. The extracted frames represent the terminal positions of the exercise.

Processing Data

For every frame, MediaPipe framework is used for applying pose estimation, which detects the human skeleton of the person in the frame. The landmark model in MediaPipe Pose predicts the location of 33 pose landmarks (see figure below). Visit Mediapipe Pose Classification page for more details.

https://mediapipe.dev/images/mobile/pose_tracking_full_body_landmarks.png" alt="33 pose landmarks">
T
robosuite_panda_pick_place_can
tensorflow.org
Updated May 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). robosuite_panda_pick_place_can [Dataset]. https://www.tensorflow.org/datasets/catalog/robosuite_panda_pick_place_can
Explore at:
Dataset updated
May 23, 2024
Description
These datasets have been created with the PickPlaceCan environment of the robosuite robotic arm simulator. The human datasets were recorded by a single operator using the RLDS Creator and a gamepad controller.

The synthetic datasets have been recorded using the EnvLogger library.

The datasets follow the RLDS format to represent steps and episodes.

Episodes consist of 400 steps. In each episode, a tag is added when the task is completed, this tag is stored as part of the custom step metadata.

Note that, due to the EnvLogger dependency, generation of this dataset is currently supported on Linux environments only.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('robosuite_panda_pick_place_can', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
VegeNet - Image datasets and Codes
zenodo.org
data.niaid.nih.gov
zip
Updated Oct 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jo Yen Tan; Jo Yen Tan (2022). VegeNet - Image datasets and Codes [Dataset]. http://doi.org/10.5281/zenodo.7254508
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7254508
Dataset updated
Oct 27, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jo Yen Tan; Jo Yen Tan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Compilation of python codes for data preprocessing and VegeNet building, as well as image datasets (zip files).

Image datasets:

vege_original : Images of vegetables captured manually in data acquisition stage

vege_cropped_renamed : Images in (1) cropped to remove background areas and image labels renamed

non-vege images : Images of non-vegetable foods for CNN network to recognize other-than-vegetable foods

food_image_dataset : Complete set of vege (2) and non-vege (3) images for architecture building.

food_image_dataset_split : Image dataset (4) split into train and test sets

process : Images created when cropping (pre-processing step) to create dataset (2).
Data from: Internet users
ons.gov.uk
cy.ons.gov.uk
xlsx
Updated Apr 6, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Office for National Statistics (2021). Internet users [Dataset]. https://www.ons.gov.uk/businessindustryandtrade/itandinternetindustry/datasets/internetusers
Explore at:
xlsxAvailable download formats
Dataset updated
Apr 6, 2021
Dataset provided by
Office for National Statisticshttp://www.ons.gov.uk/
License
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Description
Internet use in the UK annual estimates by age, sex, disability, ethnic group, economic activity and geographical location, including confidence intervals.
Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias,...
zenodo.org
data.niaid.nih.gov
zip
Updated Jan 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Federica Pepe; Vittoria Nardone; Vittoria Nardone; Antonio Mastropaolo; Antonio Mastropaolo; Gerardo Canfora; Gerardo Canfora; Gabriele BAVOTA; Gabriele BAVOTA; Massimiliano Di Penta; Massimiliano Di Penta; Federica Pepe (2024). Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study" [Dataset]. http://doi.org/10.5281/zenodo.10058142
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10058142
Dataset updated
Jan 16, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Federica Pepe; Vittoria Nardone; Vittoria Nardone; Antonio Mastropaolo; Antonio Mastropaolo; Gerardo Canfora; Gerardo Canfora; Gabriele BAVOTA; Gabriele BAVOTA; Massimiliano Di Penta; Massimiliano Di Penta; Federica Pepe
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This replication package contains datasets and scripts related to the paper: "*How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study*"

## Root directory
- `statistics.r`: R script used to compute the correlation between usage and downloads, and the RQ1/RQ2 inter-rater agreements
- `modelsInfo.zip`: zip file containing all the downloaded model cards (in JSON format)
- `script`: directory containing all the scripts used to collect and process data. For further details, see README file inside the script directory.

## Dataset
- `Dataset/Dataset_HF-models-list.csv`: list of HF models analyzed
- `Dataset/Dataset_github-prj-list.txt`: list of GitHub projects using the *transformers* library
- `Dataset/Dataset_github-Prj_model-Used.csv`: contains usage pairs: project, model
- `Dataset/Dataset_prj-num-models-reused.csv`: number of models used by each GitHub project
- `Dataset/Dataset_model-download_num-prj_correlation.csv` contains, for each model used by GitHub projects: the name, the task, the number of reusing projects, and the number of downloads

## RQ1
- `RQ1/RQ1_dataset-list.txt`: list of HF datasets
- `RQ1/RQ1_datasetSample.csv`: sample set of models used for the manual analysis of datasets
- `RQ1/RQ1_analyzeDatasetTags.py`: Python script to analyze model tags for the presence of datasets. it requires to unzip the `modelsInfo.zip` in a directory with the same name (`modelsInfo`) at the root of the replication package folder. Produces the output to stdout. To redirect in a file fo be analyzed by the `RQ2/countDataset.py` script
- `RQ1/RQ1_countDataset.py`: given the output of `RQ2/analyzeDatasetTags.py` (passed as argument) produces, for each model, a list of Booleans indicating whether (i) the model only declares HF datasets, (ii) the model only declares external datasets, (iii) the model declares both, and (iv) the model is part of the sample for the manual analysis
- `RQ1/RQ1_datasetTags.csv`: output of `RQ2/analyzeDatasetTags.py`
- `RQ1/RQ1_dataset_usage_count.csv`: output of `RQ2/countDataset.py`

## RQ2
- `RQ2/tableBias.pdf`: table detailing the number of occurrences of different types of bias by model Task
- `RQ2/RQ2_bias_classification_sheet.csv`: results of the manual labeling
- `RQ2/RQ2_isBiased.csv`: file to compute the inter-rater agreement of whether or not a model documents Bias
- `RQ2/RQ2_biasAgrLabels.csv`: file to compute the inter-rater agreement related to bias categories
- `RQ2/RQ2_final_bias_categories_with_levels.csv`: for each model in the sample, this file lists (i) the bias leaf category, (ii) the first-level category, and (iii) the intermediate category

## RQ3
- `RQ3/RQ3_LicenseValidation.csv`: manual validation of a sample of licenses
- `RQ3/RQ3_{NETWORK-RESTRICTIVE|RESTRICTIVE|WEAK-RESTRICTIVE|PERMISSIVE}-license-list.txt`: lists of licenses with different permissiveness
- `RQ3/RQ3_prjs_license.csv`: for each project linked to models, among other fields it indicates the license tag and name
- `RQ3/RQ3_models_license.csv`: for each model, indicates among other pieces of info, whether the model has a license, and if yes what kind of license
- `RQ3/RQ3_model-prj-license_contingency_table.csv`: usage contingency table between projects' licenses (columns) and models' licenses (rows)
- `RQ3/RQ3_models_prjs_licenses_with_type.csv`: pairs project-model, with their respective licenses and permissiveness level

## scripts
Contains the scripts used to mine Hugging Face and GitHub. Details are in the enclosed README
T
Data from: dices
tensorflow.org
Updated Sep 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). dices [Dataset]. https://www.tensorflow.org/datasets/catalog/dices
Explore at:
Dataset updated
Sep 3, 2024
Description
The Diversity in Conversational AI Evaluation for Safety (DICES) dataset

Machine learning approaches are often trained and evaluated with datasets that require a clear separation between positive and negative examples. This approach overly simplifies the natural subjectivity present in many tasks and content items. It also obscures the inherent diversity in human perceptions and opinions. Often tasks that attempt to preserve the variance in content and diversity in humans are quite expensive and laborious. To fill in this gap and facilitate more in-depth model performance analyses we propose the DICES dataset - a unique dataset with diverse perspectives on safety of AI generated conversations. We focus on the task of safety evaluation of conversational AI systems. The DICES dataset contains detailed demographics information about each rater, extremely high replication of unique ratings per conversation to ensure statistical significance of further analyses and encodes rater votes as distributions across different demographics to allow for in-depth explorations of different rating aggregation strategies.

This dataset is well suited to observe and measure variance, ambiguity and diversity in the context of safety of conversational AI. The dataset is accompanied by a paper describing a set of metrics that show how rater diversity influences the safety perception of raters from different geographic regions, ethnicity groups, age groups and genders. The goal of the DICES dataset is to be used as a shared benchmark for safety evaluation of conversational AI systems.

CONTENT WARNING: This dataset contains adversarial examples of conversations that may be offensive.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('dices', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
T
i_naturalist2021
tensorflow.org
Updated Sep 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). i_naturalist2021 [Dataset]. https://www.tensorflow.org/datasets/catalog/i_naturalist2021
Explore at:
Dataset updated
Sep 9, 2023
Description
The iNaturalist dataset 2021 contains a total of 10,000 species. The full training dataset contains nearly 2.7M images. To make the dataset more accessible we have also created a "mini" training dataset with 50 examples per species for a total of 500K images. The full training train split overlaps with the mini split. The val set contains for each species 10 validation images (100K in total). There are a total of 500,000 test images in the public_test split (without ground-truth labels).

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('i_naturalist2021', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/i_naturalist2021-2.0.1.png" alt="Visualization" width="500px">
smollm-corpus
huggingface.co
Updated Jul 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face Smol Models Research (2024). smollm-corpus [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 16, 2024
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face Smol Models Research
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
SmolLM-Corpus

This dataset is a curated collection of high-quality educational and synthetic data designed for training small language models. You can find more details about the models trained on this dataset in our SmolLM blog post.

Dataset subsets Cosmopedia v2

Cosmopedia v2 is an enhanced version of Cosmopedia, the largest synthetic dataset for pre-training, consisting of over 39 million textbooks, blog posts, and stories generated by… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus.

Facebook

Twitter

Click to copy link

Link copied

Cite

Eleni Triantafillou; Tyler Zhu; Vincent Dumoulin; Pascal Lamblin; Utku Evci; Kelvin Xu; Ross Goroshin; Carles Gelada; Kevin Swersky; Pierre-Antoine Manzagol; Hugo Larochelle, Meta-Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/meta-dataset

Meta-Dataset Dataset

Explore at:

Authors

Eleni Triantafillou; Tyler Zhu; Vincent Dumoulin; Pascal Lamblin; Utku Evci; Kelvin Xu; Ross Goroshin; Carles Gelada; Kevin Swersky; Pierre-Antoine Manzagol; Hugo Larochelle

Description

The Meta-Dataset benchmark is a large few-shot learning benchmark and consists of multiple datasets of different data distributions. It does not restrict few-shot tasks to have fixed ways and shots, thus representing a more realistic scenario. It consists of 10 datasets from diverse domains:

ILSVRC-2012 (the ImageNet dataset, consisting of natural images with 1000 categories) Omniglot (hand-written characters, 1623 classes) Aircraft (dataset of aircraft images, 100 classes) CUB-200-2011 (dataset of Birds, 200 classes) Describable Textures (different kinds of texture images with 43 categories) Quick Draw (black and white sketches of 345 different categories) Fungi (a large dataset of mushrooms with 1500 categories) VGG Flower (dataset of flower images with 102 categories), Traffic Signs (German traffic sign images with 43 classes) MSCOCO (images collected from Flickr, 80 classes).

All datasets except Traffic signs and MSCOCO have a training, validation and test split (proportioned roughly into 70%, 15%, 15%). The datasets Traffic Signs and MSCOCO are reserved for testing only.

Clear search

Close search

Google apps

Main menu

Meta-Dataset Dataset

Shells or Pebbles: An Image Classification Dataset

Overview

Inspiration

Job Postings Dataset for Labour Market Research and Insights

AirfRANS_original

Load the dataset

History of work (all graph datasets)

History of Work

NEW version - CHANGE notes

Pokemon Go

Orange dataset table

UCI datasets

Ecological community datasets used to evaluate the presence of trends in...

celeb_a_hq

Dataset relating a study on Geospatial Open Data usage and metadata quality

aclsum

Physical Exercise Recognition Dataset

Note:

Content

Collection Process

Processing Data

robosuite_panda_pick_place_can

VegeNet - Image datasets and Codes

Data from: Internet users

Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias,...

Data from: dices

The Diversity in Conversational AI Evaluation for Safety (DICES) dataset

i_naturalist2021

smollm-corpus

Meta-Dataset Dataset

Meta-Dataset Dataset

Shells or Pebbles: An Image Classification Dataset

Overview

Inspiration

Job Postings Dataset for Labour Market Research and Insights

AirfRANS_original

Load the dataset

History of work (all graph datasets)

History of Work

NEW version - CHANGE notes

Pokemon Go

Orange dataset table

UCI datasets

Ecological community datasets used to evaluate the presence of trends in...

celeb_a_hq

Dataset relating a study on Geospatial Open Data usage and metadata quality

aclsum

Physical Exercise Recognition Dataset

Note:

Content

Collection Process

Processing Data

robosuite_panda_pick_place_can

VegeNet - Image datasets and Codes

Data from: Internet users

Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias,...

Data from: dices

The Diversity in Conversational AI Evaluation for Safety (DICES) dataset

i_naturalist2021

smollm-corpus

Meta-Dataset DatasetSee More Versions

Meta-Dataset Dataset