100+ datasets found

Machine Learning Dataset
brightdata.com
.json, .csv, .xlsx
Updated Jun 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data (2024). Machine Learning Dataset [Dataset]. https://brightdata.com/products/datasets/machine-learning
Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset updated
Jun 19, 2024
Dataset authored and provided by
Bright Datahttps://brightdata.com/
License
https://brightdata.com/licensehttps://brightdata.com/license
Area covered
Worldwide
Description
Utilize our machine learning datasets to develop and validate your models. Our datasets are designed to support a variety of machine learning applications, from image recognition to natural language processing and recommendation systems. You can access a comprehensive dataset or tailor a subset to fit your specific requirements, using data from a combination of various sources and websites, including custom ones. Popular use cases include model training and validation, where the dataset can be used to ensure robust performance across different applications. Additionally, the dataset helps in algorithm benchmarking by providing extensive data to test and compare various machine learning algorithms, identifying the most effective ones for tasks such as fraud detection, sentiment analysis, and predictive maintenance. Furthermore, it supports feature engineering by allowing you to uncover significant data attributes, enhancing the predictive accuracy of your machine learning models for applications like customer segmentation, personalized marketing, and financial forecasting.

Human vs AI Text Classification Dataset

kaggle.com

Updated May 1, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Anastasiya Kotelnikova (2025). Human vs AI Text Classification Dataset [Dataset]. https://www.kaggle.com/datasets/aknjit/human-vs-ai-text-classification-dataset

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

May 1, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Anastasiya Kotelnikova

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

This dataset contains 5,000 custom-labeled text samples (2,500 human-written, 2,500 AI-generated) designed for binary classification of human vs AI content. Text was preprocessed using TF-IDF and used to train multiple ML classifiers (LogReg, SVC, NB, RF) with high accuracy. The dataset is balanced, ready-to-use, and ideal for text classification, model explainability, or ethical AI applications.

File Name	Description
`your_dataset_5000.csv`	5,000 labeled text samples: 2,500 human, 2,500 AI
`text_classifier_5000.joblib`	Serialized trained classifier model (LogReg, top performer)
`Human vs AI Custom Dataset.ipynb`	Main notebook: preprocessing, modeling, evaluation
`README.md`	Overview and usage instructions for the dataset

R
Pegion Model V.2 Dataset
universe.roboflow.com
zip
Updated Mar 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
B6411138ws (2025). Pegion Model V.2 Dataset [Dataset]. https://universe.roboflow.com/b6411138ws/pegion-model-v.2
Explore at:
zipAvailable download formats
Dataset updated
Mar 27, 2025
Dataset authored and provided by
B6411138ws
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Pegion Bounding Boxes
Description
Pegion Model V.2

## Overview Pegion Model V.2 is a dataset for object detection tasks - it contains Pegion annotations for 998 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Datasets for figures and tables
catalog.data.gov
datasets.ai
Updated Nov 12, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Datasets for figures and tables [Dataset]. https://catalog.data.gov/dataset/datasets-for-figures-and-tables
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
Software Model simulations were conducted using WRF version 3.8.1 (available at https://github.com/NCAR/WRFV3) and CMAQ version 5.2.1 (available at https://github.com/USEPA/CMAQ). The meteorological and concentration fields created using these models are too large to archive on ScienceHub, approximately 1 TB, and are archived on EPA’s high performance computing archival system (ASM) at /asm/MOD3APP/pcc/02.NOAH.v.CLM.v.PX/. Figures Figures 1 – 6 and Figure 8: Created using the NCAR Command Language (NCL) scripts (https://www.ncl.ucar.edu/get_started.shtml). NCLD code can be downloaded from the NCAR website (https://www.ncl.ucar.edu/Download/) at no cost. The data used for these figures are archived on EPA’s ASM system and are available upon request. Figures 7, 8b-c, 8e-f, 8h-i, and 9 were created using the AMET utility developed by U.S. EPA/ORD. AMET can be freely downloaded and used at https://github.com/USEPA/AMET. The modeled data paired in space and time provided in this archive can be used to recreate these figures. The data contained in the compressed zip files are organized in comma delimited files with descriptive headers or space delimited files that match tabular data in the manuscript. The data dictionary provides additional information about the files and their contents. This dataset is associated with the following publication: Campbell, P., J. Bash, and T. Spero. Updates to the Noah Land Surface Model in WRF‐CMAQ to Improve Simulated Meteorology, Air Quality, and Deposition. Journal of Advances in Modeling Earth Systems. John Wiley & Sons, Inc., Hoboken, NJ, USA, 11(1): 231-256, (2019).
R
Car Or Not Car Model Dataset
universe.roboflow.com
zip
Updated May 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mustafa Zincirci (2025). Car Or Not Car Model Dataset [Dataset]. https://universe.roboflow.com/mustafa-zincirci-xh4mj/car-or-not-car-model/model/9
Explore at:
zipAvailable download formats
Dataset updated
May 26, 2025
Dataset authored and provided by
Mustafa Zincirci
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Car Notcar Bounding Boxes
Description
CAR OR NOT CAR MODEL

## Overview CAR OR NOT CAR MODEL is a dataset for object detection tasks - it contains Car Notcar annotations for 2,849 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
h
AI-vs-Deepfake-vs-Real
huggingface.co
Updated Feb 23, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prithiv Sakthi (2025). AI-vs-Deepfake-vs-Real [Dataset]. https://huggingface.co/datasets/prithivMLmods/AI-vs-Deepfake-vs-Real
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 23, 2025
Authors
Prithiv Sakthi
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
AI vs Deepfake vs Real

AI vs Deepfake vs Real is a dataset designed for image classification, distinguishing between artificial, deepfake, and real images. This dataset includes a diverse collection of high-quality images to enhance classification accuracy and improve the model’s overall efficiency. By providing a well-balanced dataset, it aims to support the development of more robust AI-generated and deepfake detection models.

Label Mappings

Mapping of IDs to… See the full description on the dataset page: https://huggingface.co/datasets/prithivMLmods/AI-vs-Deepfake-vs-Real.
Dataset for modeling spatial and temporal variation in natural background...
catalog.data.gov
Updated Nov 12, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Dataset for modeling spatial and temporal variation in natural background specific conductivity [Dataset]. https://catalog.data.gov/dataset/dataset-for-modeling-spatial-and-temporal-variation-in-natural-background-specific-conduct
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
This file contains the data set used to develop a random forest model predict background specific conductivity for stream segments in the contiguous United States. This Excel readable file contains 56 columns of parameters evaluated during development. The data dictionary provides the definition of the abbreviations and the measurement units. Each row is a unique sample described as R** which indicates the NHD Hydrologic Unit (underscore), up to a 7-digit COMID, (underscore) sequential sample month. To develop models that make stream-specific predictions across the contiguous United States, we used StreamCat data set and process (Hill et al. 2016; https://github.com/USEPA/StreamCat). The StreamCat data set is based on a network of stream segments from NHD+ (McKay et al. 2012). These stream segments drain an average area of 3.1 km2 and thus define the spatial grain size of this data set. The data set consists of minimally disturbed sites representing the natural variation in environmental conditions that occur in the contiguous 48 United States. More than 2.4 million SC observations were obtained from STORET (USEPA 2016b), state natural resource agencies, the U.S. Geological Survey (USGS) National Water Information System (NWIS) system (USGS 2016), and data used in Olson and Hawkins (2012) (Table S1). Data include observations made between 1 January 2001 and 31 December 2015 thus coincident with Moderate Resolution Imaging Spectroradiometer (MODIS) satellite data (https://modis.gsfc.nasa.gov/data/). Each observation was related to the nearest stream segment in the NHD+. Data were limited to one observation per stream segment per month. SC observations with ambiguous locations and repeat measurements along a stream segment in the same month were discarded. Using estimates of anthropogenic stress derived from the StreamCat database (Hill et al. 2016), segments were selected with minimal amounts of human activity (Stoddard et al. 2006) using criteria developed for each Level II Ecoregion (Omernik and Griffith 2014). Segments were considered as potentially minimally stressed where watersheds had 0 - 0.5% impervious surface, 0 – 5% urban, 0 – 10% agriculture, and population densities from 0.8 – 30 people/km2 (Table S3). Watersheds with observations with large residuals in initial models were identified and inspected for evidence of other human activities not represented in StreamCat (e.g., mining, logging, grazing, or oil/gas extraction). Observations were removed from disturbed watersheds, with a tidal influence or unusual geologic conditions such as hot springs. About 5% of SC observations in each National Rivers and Stream Assessment (NRSA) region were then randomly selected as independent validation data. The remaining observations became the large training data set for model calibration. This dataset is associated with the following publication: Olson, J., and S. Cormier. Modeling spatial and temporal variation in natural background specific conductivity. ENVIRONMENTAL SCIENCE & TECHNOLOGY. American Chemical Society, Washington, DC, USA, 53(8): 4316-4325, (2019).
GraphaRNA dataset and model
zenodo.org
application/gzip, txt
Updated Jun 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marek Justyna; Marek Justyna; Craig Zirbel; Craig Zirbel; Maciej Antczak; Maciej Antczak; Marta Szachniuk; Marta Szachniuk (2025). GraphaRNA dataset and model [Dataset]. http://doi.org/10.5281/zenodo.13757098
Explore at:
application/gzip, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13757098
Dataset updated
Jun 5, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Marek Justyna; Marek Justyna; Craig Zirbel; Craig Zirbel; Maciej Antczak; Maciej Antczak; Marta Szachniuk; Marta Szachniuk
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Graph Neural Network and Diffusion Model for Modeling RNA Interatomic Interactions

This repository contains the datasets and the pre-trained model associated with GraphaRNA, a diffusion-based graph neural network for RNA 3D structure prediction. The data is organized into multiple files, each providing key resources for training, validation, and testing the model, as well as a pre-trained model ready for inference.

Data Overview:

rRNA_tRNA.tar.gz:

Contains raw PDB files with extracted descriptors from ribosomal RNA (rRNA) and transfer RNA (tRNA) structures.

non_rRNA_tRNA.tar.gz:

Contains raw PDB files with extracted descriptors from RNA molecules that are non-rRNA and non-tRNA. These serve as a separate test set.

train-pkl.tar.gz:

Contains the filtered and preprocessed pickle files for the training set, derived from the rRNA_tRNA dataset. These files are used to train GraphaRNA.

val-pkl.tar.gz:

Contains the validation set, which is a subset of the training data from train-pkl.tar.gz.

test-pkl.tar.gz:

Contains the preprocessed pickle files for the test set, derived from the non_rRNA_tRNA dataset. This set includes RNA descriptors that are not rRNA or tRNA, providing a challenging test scenario.

model_epoch_800.tar.gz:

Contains the pre-trained GraphaRNA model after 800 epochs of training on the train-pkl dataset. This model is ready for inference and evaluation.

all-outputs.txt:

Contains basic metadata about all descriptors: name of file, number of segments, number of nucleotides, sequence of each segment, and positions of segments in original PDB files.

Use of Data and Model:

The raw PDB files can be used for RNA descriptor extraction, while the pickle files are preprocessed for direct use in training, validation, and testing workflows.

The GraphaRNA model in model_epoch_800.tar.gz can be used to run inference on new RNA data or to reproduce results from the associated paper.

How to Use:

Training: The train-pkl.tar.gz contains data that can be used to retrain the GraphaRNA model from scratch.

Validation: The val-pkl.tar.gz can be used to validate the model during or after training.

Testing: Use the test-pkl.tar.gz to evaluate the model's performance on RNA types that it wasn't trained on (non-rRNA and non-tRNA).

Inference: The model_epoch_800.tar.gz is ready for inference on new RNA sequences.

Acknowledgments:

If you use this dataset or the pre-trained model in your research, please cite the associated paper (linked here once published).
NASA 3D Models: Saturn V
catalog.data.gov
res1catalogd-o-tdatad-o-tgov.vcapture.xyz
+1more
Updated Apr 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Aeronautics and Space Administration (2025). NASA 3D Models: Saturn V [Dataset]. https://catalog.data.gov/dataset/nasa-3d-models-saturn-v
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
Polygons: 34814 Vertices: 19011
f
Machine Learning Study of Metabolic Networks vs ChEMBL Data of Antibacterial...
acs.figshare.com
xlsx
Updated Jun 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Karel Diéguez-Santana; Gerardo M. Casañola-Martin; Roldan Torres; Bakhtiyor Rasulev; James R. Green; Humbert González-Díaz (2023). Machine Learning Study of Metabolic Networks vs ChEMBL Data of Antibacterial Compounds [Dataset]. http://doi.org/10.1021/acs.molpharmaceut.2c00029.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.molpharmaceut.2c00029.s001
Dataset updated
Jun 5, 2023
Dataset provided by
ACS Publications
Authors
Karel Diéguez-Santana; Gerardo M. Casañola-Martin; Roldan Torres; Bakhtiyor Rasulev; James R. Green; Humbert González-Díaz
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Antibacterial drugs (AD) change the metabolic status of bacteria, contributing to bacterial death. However, antibiotic resistance and the emergence of multidrug-resistant bacteria increase interest in understanding metabolic network (MN) mutations and the interaction of AD vs MN. In this study, we employed the IFPTML = Information Fusion (IF) + Perturbation Theory (PT) + Machine Learning (ML) algorithm on a huge dataset from the ChEMBL database, which contains

155,000 AD assays vs >40 MNs of multiple bacteria species. We built a linear discriminant analysis (LDA) and 17 ML models centered on the linear index and based on atoms to predict antibacterial compounds. The IFPTML-LDA model presented the following results for the training subset: specificity (Sp) = 76% out of 70,000 cases, sensitivity (Sn) = 70%, and Accuracy (Acc) = 73%. The same model also presented the following results for the validation subsets: Sp = 76%, Sn = 70%, and Acc = 73.1%. Among the IFPTML nonlinear models, the k nearest neighbors (KNN) showed the best results with Sn = 99.2%, Sp = 95.5%, Acc = 97.4%, and Area Under Receiver Operating Characteristic (AUROC) = 0.998 in training sets. In the validation series, the Random Forest had the best results: Sn = 93.96% and Sp = 87.02% (AUROC = 0.945). The IFPTML linear and nonlinear models regarding the ADs vs MNs have good statistical parameters, and they could contribute toward finding new metabolic mutations in antibiotic resistance and reducing time/costs in antibacterial drug research.
Science Education Research Topic Modeling Dataset
zenodo.org
data.niaid.nih.gov
bin, html +2
Updated Oct 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tor Ole B. Odden; Tor Ole B. Odden; Alessandro Marin; Alessandro Marin; John L. Rudolph; John L. Rudolph (2024). Science Education Research Topic Modeling Dataset [Dataset]. http://doi.org/10.5281/zenodo.4094974
Explore at:
bin, txt, html, text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4094974
Dataset updated
Oct 9, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tor Ole B. Odden; Tor Ole B. Odden; Alessandro Marin; Alessandro Marin; John L. Rudolph; John L. Rudolph
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This dataset contains scraped and processed text from roughly 100 years of articles published in the Wiley journal Science Education (formerly General Science Quarterly). This text has been cleaned and filtered in preparation for analysis using natural language processing techniques, particularly topic modeling with latent Dirichlet allocation (LDA). We also include a Jupyter Notebook illustrating how one can use LDA to analyze this dataset and extract latent topics from it, as well as analyze the rise and fall of those topics over the history of the journal.

The articles were downloaded and scraped in December of 2019. Only non-duplicate articles with a listed author (according to the CrossRef metadata database) were included, and due to missing data and text recognition issues we excluded all articles published prior to 1922. This resulted in 5577 articles in total being included in the dataset. The text of these articles was then cleaned in the following way:

We removed duplicated text from each article: prior to 1969, articles in the journal were published in a magazine format in which the end of one article and the beginning of the next would share the same page, so we developed an automated detection of article beginnings and endings that was able to remove any duplicate text.

We removed the reference sections of the articles, as well headings (in all caps) such as “ABSTRACT”.

We reunited any partial words that were separated due to line breaks, text recognition issues, or British vs. American spellings (for example converting “per cent” to “percent”)

We removed all numbers, symbols, special characters, and punctuation, and lowercased all words.

We removed all stop words, which are words without any semantic meaning on their own—“the”, “in,” “if”, “and”, “but”, etc.—and all single-letter words.

We lemmatized all words, with the added step of including a part-of-speech tagger so our algorithm would only aggregate and lemmatize words from the same part of speech (e.g., nouns vs. verbs).

We detected and create bi-grams, sets of words that frequently co-occur and carry additional meaning together. These words were combined with an underscore: for example, “problem_solving” and “high_school”.

After filtering, each document was then turned into a list of individual words (or tokens) which were then collected and saved (using the python pickle format) into the file scied_words_bigrams_V5.pkl.

In addition to this file, we have also included the following files:

SciEd_paper_names_weights.pkl: A file containing limited metadata (title, author, year published, and DOI) for each of the papers, in the same order as they appear within the main datafile. This file also includes the weights assigned by an LDA model used to analyze the data

Science Education LDA Notebook.ipynb: A notebook file that replicates our LDA analysis, with a written explanation of all of the steps and suggestions on how to explore the results.

Supporting files for the notebook. These include the requirements, the README, a helper script with functions for plotting that were too long to include in the notebook, and two HTML graphs that are embedded into the notebook.

This dataset is shared under the terms of the Wiley Text and Data Mining Agreement, which allows users to share text and data mining output for non-commercial research purposes. Any questions or comments can be directed to Tor Ole Odden, t.o.odden@fys.uio.no.
R
Buena Vs Mala Dataset
universe.roboflow.com
zip
Updated Jul 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
detectormanzanas (2025). Buena Vs Mala Dataset [Dataset]. https://universe.roboflow.com/detectormanzanas/buena-vs-mala/model/2
Explore at:
zipAvailable download formats
Dataset updated
Jul 2, 2025
Dataset authored and provided by
detectormanzanas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Manzanas Bounding Boxes
Description
Buena Vs Mala

## Overview Buena Vs Mala is a dataset for object detection tasks - it contains Manzanas annotations for 380 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
O
DEEPEN 3D PFA Index Models for Exploration Datasets at Newberry Volcano
data.openei.org
gdr.openei.org
+3more
data
Updated Jun 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicole Taverna; Hannah Pauling; Amanda Kolker; Nicole Taverna; Hannah Pauling; Amanda Kolker (2023). DEEPEN 3D PFA Index Models for Exploration Datasets at Newberry Volcano [Dataset]. http://doi.org/10.15121/1995528
Explore at:
dataAvailable download formats
Unique identifier
https://doi.org/10.15121/1995528
Dataset updated
Jun 30, 2023
Dataset provided by
National Renewable Energy Laboratory
Open Energy Data Initiative (OEDI)
USDOE Office of Energy Efficiency and Renewable Energy (EERE), Multiple Programs (EE)
Authors
Nicole Taverna; Hannah Pauling; Amanda Kolker; Nicole Taverna; Hannah Pauling; Amanda Kolker
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Newberry Volcano
Description
DEEPEN stands for DE-risking Exploration of geothermal Plays in magmatic ENvironments.

As part of the development of the DEEPEN 3D play fairway analysis (PFA) methodology for magmatic plays (conventional hydrothermal, superhot EGS, and supercritical), index models needed to be developed to map values in geoscientific exploration datasets to favorability index values. This GDR submission includes those index models.

Index models were created by binning values in exploration datasets into chunks based on their favorability, and then applying a number between 0 and 5 to each chunk, where 0 represents very unfavorable data values and 5 represents very favorable data values. To account for differences in how exploration methods are used to detect each play component, separate index models are produced for each exploration method for each component of each play type.

Index models were created using histograms of the distributions of each exploration dataset in combination with literature and input from experts about what combinations of geophysical, geological, and geochemical signatures are considered favorable at Newberry. This is in attempt to create similar sized bins based on the current understanding of how different anomalies map to favorable areas for the different types of geothermal plays (i.e., conventional hydrothermal, superhot EGS, and supercritical). For example, an area of partial melt would likely appear as an area of low density, high conductivity, low vp, and high vp/vs. This means that these target anomalies would be given high (4 or 5) index values for the purpose of imaging the heat source. To account for differences in how exploration methods are used to detect each play component, separate index models are produced for each exploration method for each component of each play type.

Index models were produced for the following datasets: - Geologic model - Alteration model - vp/vs - vp - vs - Temperature model - Seismicity (density*magnitude) - Density - Resistivity - Fault distance - Earthquake cutoff depth model
d
Data for comparison of climate envelope models developed using...
catalog.data.gov
data.usgs.gov
+1more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Data for comparison of climate envelope models developed using expert-selected variables versus statistical selection [Dataset]. https://catalog.data.gov/dataset/data-for-comparison-of-climate-envelope-models-developed-using-expert-selected-variables-v
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
The data we used for this study include species occurrence data (n=15 species), climate data and predictions, an expert opinion questionnaire, and species masks that represented the model domain for each species. For this data release, we include the results of the expert opinion questionnaire and the species model domains (or masks). We developed an expert opinion questionnaire to gather information regarding expert opinion regarding the importance of climate variables in determining a species geographic range. The species masks, or model domains, were defined separately for each species using a variation of the “target-group” approach (Phillips et al. 2009), where the domain was determine using convex polygons including occurrence data for at least three phylogenetically related and similar species (Watling et al. 2012). The species occurrence data, climate data, and climate predictions are freely available online, and therefore not included in this data release. The species occurrence data were obtained primarily from the online database Global Biodiversity Information Facility (GBIF; http://www.gbif.org/), and from scientific literature (Watling et al. 2011). Climate data were obtained from the WorldClim database (Hijmans et al. 2005) and climate predictions were obtained from the Center for Ocean-Atmosphere Prediction Studies (COAPS) at Florida State University (https://floridaclimateinstitute.org/resources/data-sets/regional-downscaling). See metadata for references.
h
human-coherence-preferences-images
huggingface.co
aifasthub.com
Updated Mar 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rapidata (2025). human-coherence-preferences-images [Dataset]. https://huggingface.co/datasets/Rapidata/human-coherence-preferences-images
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 11, 2025
Dataset authored and provided by
Rapidata
License
https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/
Description
Rapidata Image Generation Coherence Dataset

This dataset was collected in ~4 Days using the Rapidata Python API, accessible to anyone and ideal for large scale data annotation. Explore our latest model rankings on our website. If you get value from this dataset and would like to see more in the future, please consider liking it.

Overview

One of the largest human annotated coherence datasets for text-to-image models, this release contains over 1,200,000 human… See the full description on the dataset page: https://huggingface.co/datasets/Rapidata/human-coherence-preferences-images.
m
Data from: Massive Atomic Diversity: a compact universal dataset for...
archive.materialscloud.org
application/gzip, bin +2
Updated Jun 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arslan Mazitov; Sofiia Chorna; Guillaume Fraux; Marnik Bercx; Giovanni Pizzi; Sandip De; Michele Ceriotti; Arslan Mazitov; Sofiia Chorna; Guillaume Fraux; Marnik Bercx; Giovanni Pizzi; Sandip De; Michele Ceriotti (2025). Massive Atomic Diversity: a compact universal dataset for atomistic machine learning [Dataset]. http://doi.org/10.24435/materialscloud:vd-e8
Explore at:
bin, xyz, application/gzip, text/markdownAvailable download formats
Unique identifier
https://doi.org/10.24435/materialscloud:vd-e8
Dataset updated
Jun 26, 2025
Dataset provided by
Materials Cloud
Authors
Arslan Mazitov; Sofiia Chorna; Guillaume Fraux; Marnik Bercx; Giovanni Pizzi; Sandip De; Michele Ceriotti; Arslan Mazitov; Sofiia Chorna; Guillaume Fraux; Marnik Bercx; Giovanni Pizzi; Sandip De; Michele Ceriotti
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The development of machine-learning models for atomic-scale simulations has benefitted tremendously from the large databases of materials and molecular properties computed in the past two decades using electronic-structure calculations. More recently, these databases have made it possible to train “universal” models that aim at making accurate predictions for arbitrary atomic geometries and compositions. The construction of many of these databases was however in itself aimed at materials discovery, and therefore targeted primarily to sample stable, or at least plausible, structures and to make the most accurate predictions for each compound – e.g. adjusting the calculation details to the material at hand. Here we introduce a dataset designed specifically to train models that can provide reasonable predictions for arbitrary structures, and that therefore follows a different philosophy. Starting from relatively small sets of stable structures, the dataset is built to contain “massive atomic diversity” (MAD) by aggressively distorting these configurations, with near-complete disregard for the stability of the resulting configurations. The electronic structure details, on the other hand, are chosen to maximize consistency rather than to obtain the most accurate prediction for
a given structure, or to minimize computational effort. The MAD dataset we present here, despite containing fewer than 100k structures, has already been shown to enable training universal interatomic potentials that are competitive with models trained on traditional datasets with two to three orders of magnitude more structures. We describe in detail the philosophy and details of the construction of the MAD dataset. We also introduce a low-dimensional structural latent space that allows us to compare it with other popular datasets, and that can also be used as a general-purpose materials cartography tool.
Fire Dataset
kaggle.com
Updated Nov 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dani215 (2024). Fire Dataset [Dataset]. https://www.kaggle.com/datasets/dani215/fire-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 10, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Dani215
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset to train models to detect forest fires or other fire-related incidents. Has a folder "fire" with 5853 images of fire occurring in many different situations, and a folder "not_fire" with 9755 common images: Urban spaces, forests, deserts, rivers, oceans, animals, people, all sorta thing.
h
Data from: p4d
huggingface.co
Updated Jun 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhi-Yi Chin (2024). p4d [Dataset]. https://huggingface.co/datasets/joycenerd/p4d
Explore at:
Dataset updated
Jun 11, 2024
Authors
Zhi-Yi Chin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Prompting4Debugging Dataset

This dataset contains prompts designed to evaluate and challenge the safety mechanisms of generative text-to-image models, with a particular focus on identifying prompts that are likely to produce images containing nudity. Introduced in the 2024 ICML paper Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts, this dataset is not specific to any single approach or model but is intended to test various mitigating… See the full description on the dataset page: https://huggingface.co/datasets/joycenerd/p4d.
Z
PULSE dataset
data.niaid.nih.gov
data.europa.eu
Updated Feb 11, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexandre Esse (2021). PULSE dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3928561
Explore at:
Dataset updated
Feb 11, 2021
Dataset provided by
Vladimir Urosevic
Alexandre Esse
Description
Motivation

This dataset is derived and cleaned from the full PULSE project dataset to share with others data gathered about the users during the project.

Disclaimer

Any third party need to respect ethics rules and GDPR and must mention “PULSE DATA H2020 - 727816” in any dissemination activities related to data being exploited. Also, you should provide a link to the project associated website: http://www.project-pulse.eu/

The data provided in the files is provided as is. Despite our best efforts at filtering out potential issues, some information could be erroneous.

Description of the dataset

The only difference with the original dataset comes from anonymised user information.

The dataset content is described in a dedicated JSON file:

{ "citizen_id": "pseudonymized unique key of each citizen user in the PULSE system", "city_code": { "description": "3-letter city codes taken by convention from IATA codebook of airports and metropolitan areas, as the codebook of global cities in most common and widespread use and therefore adopted as standard in PULSE (since there is currently - in the year 2020 - still no relevant ISO or other standardized codebook of cities uniformly globally adopted and used). Exception is Pavia which does not have its own airport,and nearby Milan/Bergamo airports are not applicable, so the 'PAI' internal code (not existing in original IATA codes) has been devised in PULSE. For cities with multiple airports, IATA metropolitan area codes are used (New York, Paris).", "BCN": "Barcelona", "BHX": "Birmingham", "NYC": "New York", "PAI": "Pavia", "PAR": "Paris", "SIN": "Singapore", "TPE": "Keelung(Taipei)" }, "zip_code": "Zip or postal code (area) within a city, basic default granular territorial/administrative subdivision unit for localization of citizen users by place of residence (in all PULSE cities)", "models": { "asthma_risk_score": "PULSE asthma risk consensus model score, decimal value ranging from 0 to 1", "asthma_risk_score_category": { "description": "Categorized value of the PULSE asthma risk consensus model score, with the following possible category options:", "low": "low asthma risk, score value below 0,05", "medium-low": "medium-low asthma risk, score value from 0,05 and below 0,1", "medium": "medium asthma risk, score value from 0,1 and below 0,15", "medium-high": "medium-high asthma risk, score value from 0,15 and below 0,2", "high": "high asthma risk, score value from 0,2 and higher" }, "T2D_risk_score": "PULSE diabetes type 2 (T2D) risk consensus model score, decimal value ranging from 0 to 1", "T2D_risk_score_category": { "description": "Categorized value of the PULSE diabetes type 2 risk consensus model score, with the following possible category options:", "low": "low T2D risk, score value below 0,05", "medium-low": "medium-low T2D risk, score value from 0,05 and below 0,1", "medium": "medium T2D risk, score value from 0,1 and below 0,15", "medium-high": "medium-high T2D risk, score value from 0,15 and below 0,2", "high": "high T2D risk, score value from 0,2 and below 0,25", "very_high": "very high T2D risk, score value from 0,25 and higher" }, "well-being_score": "PULSE well-being model score, decimal value ranging from -5 to 5", "well-being_score_category": { "description": "Categorized value of the PULSE well-being model score, with the following possible category options:", "low": "low well-being, score value below -0,37", "medium-low": "medium-low well-being, score value from -0,37 and below 0,04", "medium-high": "medium-high well-being, score value from 0,04 and below 0,36", "high": "high well-being, score value from 0,36 and higher" }, "computed_time": "Timestamp (UTC) when each relevant model score value/result had been computed or derived"
} }
d
Raster Dataset Model of Oil Shale Resources in the Piceance Basin, Colorado
catalog.data.gov
data.usgs.gov
+2more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Raster Dataset Model of Oil Shale Resources in the Piceance Basin, Colorado [Dataset]. https://catalog.data.gov/dataset/raster-dataset-model-of-oil-shale-resources-in-the-piceance-basin-colorado
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Area covered
Colorado
Description
ESRI GRID raster datasets were created to display and quantify oil shale resources for seventeen zones in the Piceance Basin, Colorado as part of a 2009 National Oil Shale Assessment. The oil shale zones in descending order are: Bed 44, A Groove, Mahogany Zone, B Groove, R-6, L-5, R-5, L-4, R-4, L-3, R-3, L-2, R-2, L-1, R-1, L-0, and R-0. Each raster cell represents a one-acre square of the land surface and contains values for either oil yield in barrels per acre, gallons per ton, or isopach thickness, in feet, as defined by the grid name: _b (barrels per acre), _g (gallons per ton), and _i (isopach thickness) where "" can be replaced by the name of the oil shale zone.

Facebook

Twitter

Click to copy link

Link copied

Cite

Bright Data (2024). Machine Learning Dataset [Dataset]. https://brightdata.com/products/datasets/machine-learning

Machine Learning Dataset

Explore at:

.json, .csv, .xlsxAvailable download formats

Dataset updated

Jun 19, 2024

Dataset authored and provided by

Bright Datahttps://brightdata.com/

License

https://brightdata.com/licensehttps://brightdata.com/license

Area covered

Worldwide

Description

Utilize our machine learning datasets to develop and validate your models. Our datasets are designed to support a variety of machine learning applications, from image recognition to natural language processing and recommendation systems. You can access a comprehensive dataset or tailor a subset to fit your specific requirements, using data from a combination of various sources and websites, including custom ones. Popular use cases include model training and validation, where the dataset can be used to ensure robust performance across different applications. Additionally, the dataset helps in algorithm benchmarking by providing extensive data to test and compare various machine learning algorithms, identifying the most effective ones for tasks such as fraud detection, sentiment analysis, and predictive maintenance. Furthermore, it supports feature engineering by allowing you to uncover significant data attributes, enhancing the predictive accuracy of your machine learning models for applications like customer segmentation, personalized marketing, and financial forecasting.

Clear search

Close search

Google apps

Main menu

Machine Learning Dataset

Human vs AI Text Classification Dataset

Pegion Model V.2 Dataset

Pegion Model V.2

Datasets for figures and tables

Car Or Not Car Model Dataset

CAR OR NOT CAR MODEL

AI-vs-Deepfake-vs-Real

Dataset for modeling spatial and temporal variation in natural background...

GraphaRNA dataset and model

Graph Neural Network and Diffusion Model for Modeling RNA Interatomic Interactions

Data Overview:

Use of Data and Model:

How to Use:

Acknowledgments:

NASA 3D Models: Saturn V

Machine Learning Study of Metabolic Networks vs ChEMBL Data of Antibacterial...

Science Education Research Topic Modeling Dataset

Buena Vs Mala Dataset

Buena Vs Mala

DEEPEN 3D PFA Index Models for Exploration Datasets at Newberry Volcano

Data for comparison of climate envelope models developed using...

human-coherence-preferences-images

Data from: Massive Atomic Diversity: a compact universal dataset for...

Fire Dataset

Data from: p4d

PULSE dataset

Raster Dataset Model of Oil Shale Resources in the Piceance Basin, Colorado

Machine Learning Dataset