100+ datasets found
  1. Machine Learning Dataset

    • brightdata.com
    .json, .csv, .xlsx
    Updated Jun 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bright Data (2024). Machine Learning Dataset [Dataset]. https://brightdata.com/products/datasets/machine-learning
    Explore at:
    .json, .csv, .xlsxAvailable download formats
    Dataset updated
    Jun 19, 2024
    Dataset authored and provided by
    Bright Datahttps://brightdata.com/
    License

    https://brightdata.com/licensehttps://brightdata.com/license

    Area covered
    Worldwide
    Description

    Utilize our machine learning datasets to develop and validate your models. Our datasets are designed to support a variety of machine learning applications, from image recognition to natural language processing and recommendation systems. You can access a comprehensive dataset or tailor a subset to fit your specific requirements, using data from a combination of various sources and websites, including custom ones. Popular use cases include model training and validation, where the dataset can be used to ensure robust performance across different applications. Additionally, the dataset helps in algorithm benchmarking by providing extensive data to test and compare various machine learning algorithms, identifying the most effective ones for tasks such as fraud detection, sentiment analysis, and predictive maintenance. Furthermore, it supports feature engineering by allowing you to uncover significant data attributes, enhancing the predictive accuracy of your machine learning models for applications like customer segmentation, personalized marketing, and financial forecasting.

  2. Human vs AI Text Classification Dataset

    • kaggle.com
    Updated May 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anastasiya Kotelnikova (2025). Human vs AI Text Classification Dataset [Dataset]. https://www.kaggle.com/datasets/aknjit/human-vs-ai-text-classification-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 1, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Anastasiya Kotelnikova
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset contains 5,000 custom-labeled text samples (2,500 human-written, 2,500 AI-generated) designed for binary classification of human vs AI content. Text was preprocessed using TF-IDF and used to train multiple ML classifiers (LogReg, SVC, NB, RF) with high accuracy. The dataset is balanced, ready-to-use, and ideal for text classification, model explainability, or ethical AI applications.

    File NameDescription
    your_dataset_5000.csv5,000 labeled text samples: 2,500 human, 2,500 AI
    text_classifier_5000.joblibSerialized trained classifier model (LogReg, top performer)
    Human vs AI Custom Dataset.ipynbMain notebook: preprocessing, modeling, evaluation
    README.mdOverview and usage instructions for the dataset
  3. R

    Pegion Model V.2 Dataset

    • universe.roboflow.com
    zip
    Updated Mar 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    B6411138ws (2025). Pegion Model V.2 Dataset [Dataset]. https://universe.roboflow.com/b6411138ws/pegion-model-v.2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 27, 2025
    Dataset authored and provided by
    B6411138ws
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Pegion Bounding Boxes
    Description

    Pegion Model V.2

    ## Overview
    
    Pegion Model V.2 is a dataset for object detection tasks - it contains Pegion annotations for 998 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  4. Datasets for figures and tables

    • catalog.data.gov
    • datasets.ai
    Updated Nov 12, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). Datasets for figures and tables [Dataset]. https://catalog.data.gov/dataset/datasets-for-figures-and-tables
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    Software Model simulations were conducted using WRF version 3.8.1 (available at https://github.com/NCAR/WRFV3) and CMAQ version 5.2.1 (available at https://github.com/USEPA/CMAQ). The meteorological and concentration fields created using these models are too large to archive on ScienceHub, approximately 1 TB, and are archived on EPA’s high performance computing archival system (ASM) at /asm/MOD3APP/pcc/02.NOAH.v.CLM.v.PX/. Figures Figures 1 – 6 and Figure 8: Created using the NCAR Command Language (NCL) scripts (https://www.ncl.ucar.edu/get_started.shtml). NCLD code can be downloaded from the NCAR website (https://www.ncl.ucar.edu/Download/) at no cost. The data used for these figures are archived on EPA’s ASM system and are available upon request. Figures 7, 8b-c, 8e-f, 8h-i, and 9 were created using the AMET utility developed by U.S. EPA/ORD. AMET can be freely downloaded and used at https://github.com/USEPA/AMET. The modeled data paired in space and time provided in this archive can be used to recreate these figures. The data contained in the compressed zip files are organized in comma delimited files with descriptive headers or space delimited files that match tabular data in the manuscript. The data dictionary provides additional information about the files and their contents. This dataset is associated with the following publication: Campbell, P., J. Bash, and T. Spero. Updates to the Noah Land Surface Model in WRF‐CMAQ to Improve Simulated Meteorology, Air Quality, and Deposition. Journal of Advances in Modeling Earth Systems. John Wiley & Sons, Inc., Hoboken, NJ, USA, 11(1): 231-256, (2019).

  5. R

    Car Or Not Car Model Dataset

    • universe.roboflow.com
    zip
    Updated May 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mustafa Zincirci (2025). Car Or Not Car Model Dataset [Dataset]. https://universe.roboflow.com/mustafa-zincirci-xh4mj/car-or-not-car-model/model/9
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 26, 2025
    Dataset authored and provided by
    Mustafa Zincirci
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Car Notcar Bounding Boxes
    Description

    CAR OR NOT CAR MODEL

    ## Overview
    
    CAR OR NOT CAR MODEL is a dataset for object detection tasks - it contains Car Notcar annotations for 2,849 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  6. h

    AI-vs-Deepfake-vs-Real

    • huggingface.co
    Updated Feb 23, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prithiv Sakthi (2025). AI-vs-Deepfake-vs-Real [Dataset]. https://huggingface.co/datasets/prithivMLmods/AI-vs-Deepfake-vs-Real
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 23, 2025
    Authors
    Prithiv Sakthi
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    AI vs Deepfake vs Real

    AI vs Deepfake vs Real is a dataset designed for image classification, distinguishing between artificial, deepfake, and real images. This dataset includes a diverse collection of high-quality images to enhance classification accuracy and improve the model’s overall efficiency. By providing a well-balanced dataset, it aims to support the development of more robust AI-generated and deepfake detection models.

      Label Mappings
    

    Mapping of IDs to… See the full description on the dataset page: https://huggingface.co/datasets/prithivMLmods/AI-vs-Deepfake-vs-Real.

  7. Dataset for modeling spatial and temporal variation in natural background...

    • catalog.data.gov
    Updated Nov 12, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). Dataset for modeling spatial and temporal variation in natural background specific conductivity [Dataset]. https://catalog.data.gov/dataset/dataset-for-modeling-spatial-and-temporal-variation-in-natural-background-specific-conduct
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    This file contains the data set used to develop a random forest model predict background specific conductivity for stream segments in the contiguous United States. This Excel readable file contains 56 columns of parameters evaluated during development. The data dictionary provides the definition of the abbreviations and the measurement units. Each row is a unique sample described as R** which indicates the NHD Hydrologic Unit (underscore), up to a 7-digit COMID, (underscore) sequential sample month. To develop models that make stream-specific predictions across the contiguous United States, we used StreamCat data set and process (Hill et al. 2016; https://github.com/USEPA/StreamCat). The StreamCat data set is based on a network of stream segments from NHD+ (McKay et al. 2012). These stream segments drain an average area of 3.1 km2 and thus define the spatial grain size of this data set. The data set consists of minimally disturbed sites representing the natural variation in environmental conditions that occur in the contiguous 48 United States. More than 2.4 million SC observations were obtained from STORET (USEPA 2016b), state natural resource agencies, the U.S. Geological Survey (USGS) National Water Information System (NWIS) system (USGS 2016), and data used in Olson and Hawkins (2012) (Table S1). Data include observations made between 1 January 2001 and 31 December 2015 thus coincident with Moderate Resolution Imaging Spectroradiometer (MODIS) satellite data (https://modis.gsfc.nasa.gov/data/). Each observation was related to the nearest stream segment in the NHD+. Data were limited to one observation per stream segment per month. SC observations with ambiguous locations and repeat measurements along a stream segment in the same month were discarded. Using estimates of anthropogenic stress derived from the StreamCat database (Hill et al. 2016), segments were selected with minimal amounts of human activity (Stoddard et al. 2006) using criteria developed for each Level II Ecoregion (Omernik and Griffith 2014). Segments were considered as potentially minimally stressed where watersheds had 0 - 0.5% impervious surface, 0 – 5% urban, 0 – 10% agriculture, and population densities from 0.8 – 30 people/km2 (Table S3). Watersheds with observations with large residuals in initial models were identified and inspected for evidence of other human activities not represented in StreamCat (e.g., mining, logging, grazing, or oil/gas extraction). Observations were removed from disturbed watersheds, with a tidal influence or unusual geologic conditions such as hot springs. About 5% of SC observations in each National Rivers and Stream Assessment (NRSA) region were then randomly selected as independent validation data. The remaining observations became the large training data set for model calibration. This dataset is associated with the following publication: Olson, J., and S. Cormier. Modeling spatial and temporal variation in natural background specific conductivity. ENVIRONMENTAL SCIENCE & TECHNOLOGY. American Chemical Society, Washington, DC, USA, 53(8): 4316-4325, (2019).

  8. GraphaRNA dataset and model

    • zenodo.org
    application/gzip, txt
    Updated Jun 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marek Justyna; Marek Justyna; Craig Zirbel; Craig Zirbel; Maciej Antczak; Maciej Antczak; Marta Szachniuk; Marta Szachniuk (2025). GraphaRNA dataset and model [Dataset]. http://doi.org/10.5281/zenodo.13757098
    Explore at:
    application/gzip, txtAvailable download formats
    Dataset updated
    Jun 5, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marek Justyna; Marek Justyna; Craig Zirbel; Craig Zirbel; Maciej Antczak; Maciej Antczak; Marta Szachniuk; Marta Szachniuk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Graph Neural Network and Diffusion Model for Modeling RNA Interatomic Interactions

    This repository contains the datasets and the pre-trained model associated with GraphaRNA, a diffusion-based graph neural network for RNA 3D structure prediction. The data is organized into multiple files, each providing key resources for training, validation, and testing the model, as well as a pre-trained model ready for inference.

    Data Overview:

    1. rRNA_tRNA.tar.gz:

      • Contains raw PDB files with extracted descriptors from ribosomal RNA (rRNA) and transfer RNA (tRNA) structures.
    2. non_rRNA_tRNA.tar.gz:

      • Contains raw PDB files with extracted descriptors from RNA molecules that are non-rRNA and non-tRNA. These serve as a separate test set.
    3. train-pkl.tar.gz:

      • Contains the filtered and preprocessed pickle files for the training set, derived from the rRNA_tRNA dataset. These files are used to train GraphaRNA.
    4. val-pkl.tar.gz:

      • Contains the validation set, which is a subset of the training data from train-pkl.tar.gz.
    5. test-pkl.tar.gz:

      • Contains the preprocessed pickle files for the test set, derived from the non_rRNA_tRNA dataset. This set includes RNA descriptors that are not rRNA or tRNA, providing a challenging test scenario.
    6. model_epoch_800.tar.gz:

      • Contains the pre-trained GraphaRNA model after 800 epochs of training on the train-pkl dataset. This model is ready for inference and evaluation.
    7. all-outputs.txt:
      • Contains basic metadata about all descriptors: name of file, number of segments, number of nucleotides, sequence of each segment, and positions of segments in original PDB files.

    Use of Data and Model:

    • The raw PDB files can be used for RNA descriptor extraction, while the pickle files are preprocessed for direct use in training, validation, and testing workflows.
    • The GraphaRNA model in model_epoch_800.tar.gz can be used to run inference on new RNA data or to reproduce results from the associated paper.

    How to Use:

    • Training: The train-pkl.tar.gz contains data that can be used to retrain the GraphaRNA model from scratch.
    • Validation: The val-pkl.tar.gz can be used to validate the model during or after training.
    • Testing: Use the test-pkl.tar.gz to evaluate the model's performance on RNA types that it wasn't trained on (non-rRNA and non-tRNA).
    • Inference: The model_epoch_800.tar.gz is ready for inference on new RNA sequences.

    Acknowledgments:

    If you use this dataset or the pre-trained model in your research, please cite the associated paper (linked here once published).

  9. NASA 3D Models: Saturn V

    • catalog.data.gov
    • res1catalogd-o-tdatad-o-tgov.vcapture.xyz
    • +1more
    Updated Apr 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Aeronautics and Space Administration (2025). NASA 3D Models: Saturn V [Dataset]. https://catalog.data.gov/dataset/nasa-3d-models-saturn-v
    Explore at:
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    Polygons: 34814 Vertices: 19011

  10. f

    Machine Learning Study of Metabolic Networks vs ChEMBL Data of Antibacterial...

    • acs.figshare.com
    xlsx
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Karel Diéguez-Santana; Gerardo M. Casañola-Martin; Roldan Torres; Bakhtiyor Rasulev; James R. Green; Humbert González-Díaz (2023). Machine Learning Study of Metabolic Networks vs ChEMBL Data of Antibacterial Compounds [Dataset]. http://doi.org/10.1021/acs.molpharmaceut.2c00029.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    ACS Publications
    Authors
    Karel Diéguez-Santana; Gerardo M. Casañola-Martin; Roldan Torres; Bakhtiyor Rasulev; James R. Green; Humbert González-Díaz
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Antibacterial drugs (AD) change the metabolic status of bacteria, contributing to bacterial death. However, antibiotic resistance and the emergence of multidrug-resistant bacteria increase interest in understanding metabolic network (MN) mutations and the interaction of AD vs MN. In this study, we employed the IFPTML = Information Fusion (IF) + Perturbation Theory (PT) + Machine Learning (ML) algorithm on a huge dataset from the ChEMBL database, which contains

    155,000 AD assays vs >40 MNs of multiple bacteria species. We built a linear discriminant analysis (LDA) and 17 ML models centered on the linear index and based on atoms to predict antibacterial compounds. The IFPTML-LDA model presented the following results for the training subset: specificity (Sp) = 76% out of 70,000 cases, sensitivity (Sn) = 70%, and Accuracy (Acc) = 73%. The same model also presented the following results for the validation subsets: Sp = 76%, Sn = 70%, and Acc = 73.1%. Among the IFPTML nonlinear models, the k nearest neighbors (KNN) showed the best results with Sn = 99.2%, Sp = 95.5%, Acc = 97.4%, and Area Under Receiver Operating Characteristic (AUROC) = 0.998 in training sets. In the validation series, the Random Forest had the best results: Sn = 93.96% and Sp = 87.02% (AUROC = 0.945). The IFPTML linear and nonlinear models regarding the ADs vs MNs have good statistical parameters, and they could contribute toward finding new metabolic mutations in antibiotic resistance and reducing time/costs in antibacterial drug research.

  11. Science Education Research Topic Modeling Dataset

    • zenodo.org
    • data.niaid.nih.gov
    bin, html +2
    Updated Oct 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tor Ole B. Odden; Tor Ole B. Odden; Alessandro Marin; Alessandro Marin; John L. Rudolph; John L. Rudolph (2024). Science Education Research Topic Modeling Dataset [Dataset]. http://doi.org/10.5281/zenodo.4094974
    Explore at:
    bin, txt, html, text/x-pythonAvailable download formats
    Dataset updated
    Oct 9, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Tor Ole B. Odden; Tor Ole B. Odden; Alessandro Marin; Alessandro Marin; John L. Rudolph; John L. Rudolph
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This dataset contains scraped and processed text from roughly 100 years of articles published in the Wiley journal Science Education (formerly General Science Quarterly). This text has been cleaned and filtered in preparation for analysis using natural language processing techniques, particularly topic modeling with latent Dirichlet allocation (LDA). We also include a Jupyter Notebook illustrating how one can use LDA to analyze this dataset and extract latent topics from it, as well as analyze the rise and fall of those topics over the history of the journal.

    The articles were downloaded and scraped in December of 2019. Only non-duplicate articles with a listed author (according to the CrossRef metadata database) were included, and due to missing data and text recognition issues we excluded all articles published prior to 1922. This resulted in 5577 articles in total being included in the dataset. The text of these articles was then cleaned in the following way:

    • We removed duplicated text from each article: prior to 1969, articles in the journal were published in a magazine format in which the end of one article and the beginning of the next would share the same page, so we developed an automated detection of article beginnings and endings that was able to remove any duplicate text.
    • We removed the reference sections of the articles, as well headings (in all caps) such as “ABSTRACT”.
    • We reunited any partial words that were separated due to line breaks, text recognition issues, or British vs. American spellings (for example converting “per cent” to “percent”)
    • We removed all numbers, symbols, special characters, and punctuation, and lowercased all words.
    • We removed all stop words, which are words without any semantic meaning on their own—“the”, “in,” “if”, “and”, “but”, etc.—and all single-letter words.
    • We lemmatized all words, with the added step of including a part-of-speech tagger so our algorithm would only aggregate and lemmatize words from the same part of speech (e.g., nouns vs. verbs).
    • We detected and create bi-grams, sets of words that frequently co-occur and carry additional meaning together. These words were combined with an underscore: for example, “problem_solving” and “high_school”.

    After filtering, each document was then turned into a list of individual words (or tokens) which were then collected and saved (using the python pickle format) into the file scied_words_bigrams_V5.pkl.

    In addition to this file, we have also included the following files:

    1. SciEd_paper_names_weights.pkl: A file containing limited metadata (title, author, year published, and DOI) for each of the papers, in the same order as they appear within the main datafile. This file also includes the weights assigned by an LDA model used to analyze the data
    2. Science Education LDA Notebook.ipynb: A notebook file that replicates our LDA analysis, with a written explanation of all of the steps and suggestions on how to explore the results.
    3. Supporting files for the notebook. These include the requirements, the README, a helper script with functions for plotting that were too long to include in the notebook, and two HTML graphs that are embedded into the notebook.

    This dataset is shared under the terms of the Wiley Text and Data Mining Agreement, which allows users to share text and data mining output for non-commercial research purposes. Any questions or comments can be directed to Tor Ole Odden, t.o.odden@fys.uio.no.

  12. R

    Buena Vs Mala Dataset

    • universe.roboflow.com
    zip
    Updated Jul 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    detectormanzanas (2025). Buena Vs Mala Dataset [Dataset]. https://universe.roboflow.com/detectormanzanas/buena-vs-mala/model/2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 2, 2025
    Dataset authored and provided by
    detectormanzanas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Manzanas Bounding Boxes
    Description

    Buena Vs Mala

    ## Overview
    
    Buena Vs Mala is a dataset for object detection tasks - it contains Manzanas annotations for 380 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  13. O

    DEEPEN 3D PFA Index Models for Exploration Datasets at Newberry Volcano

    • data.openei.org
    • gdr.openei.org
    • +3more
    data
    Updated Jun 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicole Taverna; Hannah Pauling; Amanda Kolker; Nicole Taverna; Hannah Pauling; Amanda Kolker (2023). DEEPEN 3D PFA Index Models for Exploration Datasets at Newberry Volcano [Dataset]. http://doi.org/10.15121/1995528
    Explore at:
    dataAvailable download formats
    Dataset updated
    Jun 30, 2023
    Dataset provided by
    National Renewable Energy Laboratory
    Open Energy Data Initiative (OEDI)
    USDOE Office of Energy Efficiency and Renewable Energy (EERE), Multiple Programs (EE)
    Authors
    Nicole Taverna; Hannah Pauling; Amanda Kolker; Nicole Taverna; Hannah Pauling; Amanda Kolker
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Newberry Volcano
    Description

    DEEPEN stands for DE-risking Exploration of geothermal Plays in magmatic ENvironments.

    As part of the development of the DEEPEN 3D play fairway analysis (PFA) methodology for magmatic plays (conventional hydrothermal, superhot EGS, and supercritical), index models needed to be developed to map values in geoscientific exploration datasets to favorability index values. This GDR submission includes those index models.

    Index models were created by binning values in exploration datasets into chunks based on their favorability, and then applying a number between 0 and 5 to each chunk, where 0 represents very unfavorable data values and 5 represents very favorable data values. To account for differences in how exploration methods are used to detect each play component, separate index models are produced for each exploration method for each component of each play type.

    Index models were created using histograms of the distributions of each exploration dataset in combination with literature and input from experts about what combinations of geophysical, geological, and geochemical signatures are considered favorable at Newberry. This is in attempt to create similar sized bins based on the current understanding of how different anomalies map to favorable areas for the different types of geothermal plays (i.e., conventional hydrothermal, superhot EGS, and supercritical). For example, an area of partial melt would likely appear as an area of low density, high conductivity, low vp, and high vp/vs. This means that these target anomalies would be given high (4 or 5) index values for the purpose of imaging the heat source. To account for differences in how exploration methods are used to detect each play component, separate index models are produced for each exploration method for each component of each play type.

    Index models were produced for the following datasets: - Geologic model - Alteration model - vp/vs - vp - vs - Temperature model - Seismicity (density*magnitude) - Density - Resistivity - Fault distance - Earthquake cutoff depth model

  14. d

    Data for comparison of climate envelope models developed using...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Data for comparison of climate envelope models developed using expert-selected variables versus statistical selection [Dataset]. https://catalog.data.gov/dataset/data-for-comparison-of-climate-envelope-models-developed-using-expert-selected-variables-v
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    The data we used for this study include species occurrence data (n=15 species), climate data and predictions, an expert opinion questionnaire, and species masks that represented the model domain for each species. For this data release, we include the results of the expert opinion questionnaire and the species model domains (or masks). We developed an expert opinion questionnaire to gather information regarding expert opinion regarding the importance of climate variables in determining a species geographic range. The species masks, or model domains, were defined separately for each species using a variation of the “target-group” approach (Phillips et al. 2009), where the domain was determine using convex polygons including occurrence data for at least three phylogenetically related and similar species (Watling et al. 2012). The species occurrence data, climate data, and climate predictions are freely available online, and therefore not included in this data release. The species occurrence data were obtained primarily from the online database Global Biodiversity Information Facility (GBIF; http://www.gbif.org/), and from scientific literature (Watling et al. 2011). Climate data were obtained from the WorldClim database (Hijmans et al. 2005) and climate predictions were obtained from the Center for Ocean-Atmosphere Prediction Studies (COAPS) at Florida State University (https://floridaclimateinstitute.org/resources/data-sets/regional-downscaling). See metadata for references.

  15. h

    human-coherence-preferences-images

    • huggingface.co
    • aifasthub.com
    Updated Mar 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rapidata (2025). human-coherence-preferences-images [Dataset]. https://huggingface.co/datasets/Rapidata/human-coherence-preferences-images
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 11, 2025
    Dataset authored and provided by
    Rapidata
    License

    https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/

    Description

    Rapidata Image Generation Coherence Dataset

    This dataset was collected in ~4 Days using the Rapidata Python API, accessible to anyone and ideal for large scale data annotation. Explore our latest model rankings on our website. If you get value from this dataset and would like to see more in the future, please consider liking it.

      Overview
    

    One of the largest human annotated coherence datasets for text-to-image models, this release contains over 1,200,000 human… See the full description on the dataset page: https://huggingface.co/datasets/Rapidata/human-coherence-preferences-images.

  16. m

    Data from: Massive Atomic Diversity: a compact universal dataset for...

    • archive.materialscloud.org
    application/gzip, bin +2
    Updated Jun 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arslan Mazitov; Sofiia Chorna; Guillaume Fraux; Marnik Bercx; Giovanni Pizzi; Sandip De; Michele Ceriotti; Arslan Mazitov; Sofiia Chorna; Guillaume Fraux; Marnik Bercx; Giovanni Pizzi; Sandip De; Michele Ceriotti (2025). Massive Atomic Diversity: a compact universal dataset for atomistic machine learning [Dataset]. http://doi.org/10.24435/materialscloud:vd-e8
    Explore at:
    bin, xyz, application/gzip, text/markdownAvailable download formats
    Dataset updated
    Jun 26, 2025
    Dataset provided by
    Materials Cloud
    Authors
    Arslan Mazitov; Sofiia Chorna; Guillaume Fraux; Marnik Bercx; Giovanni Pizzi; Sandip De; Michele Ceriotti; Arslan Mazitov; Sofiia Chorna; Guillaume Fraux; Marnik Bercx; Giovanni Pizzi; Sandip De; Michele Ceriotti
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The development of machine-learning models for atomic-scale simulations has benefitted tremendously from the large databases of materials and molecular properties computed in the past two decades using electronic-structure calculations. More recently, these databases have made it possible to train “universal” models that aim at making accurate predictions for arbitrary atomic geometries and compositions. The construction of many of these databases was however in itself aimed at materials discovery, and therefore targeted primarily to sample stable, or at least plausible, structures and to make the most accurate predictions for each compound – e.g. adjusting the calculation details to the material at hand. Here we introduce a dataset designed specifically to train models that can provide reasonable predictions for arbitrary structures, and that therefore follows a different philosophy. Starting from relatively small sets of stable structures, the dataset is built to contain “massive atomic diversity” (MAD) by aggressively distorting these configurations, with near-complete disregard for the stability of the resulting configurations. The electronic structure details, on the other hand, are chosen to maximize consistency rather than to obtain the most accurate prediction for
    a given structure, or to minimize computational effort. The MAD dataset we present here, despite containing fewer than 100k structures, has already been shown to enable training universal interatomic potentials that are competitive with models trained on traditional datasets with two to three orders of magnitude more structures. We describe in detail the philosophy and details of the construction of the MAD dataset. We also introduce a low-dimensional structural latent space that allows us to compare it with other popular datasets, and that can also be used as a general-purpose materials cartography tool.

  17. Fire Dataset

    • kaggle.com
    Updated Nov 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dani215 (2024). Fire Dataset [Dataset]. https://www.kaggle.com/datasets/dani215/fire-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 10, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Dani215
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset to train models to detect forest fires or other fire-related incidents. Has a folder "fire" with 5853 images of fire occurring in many different situations, and a folder "not_fire" with 9755 common images: Urban spaces, forests, deserts, rivers, oceans, animals, people, all sorta thing.

  18. h

    Data from: p4d

    • huggingface.co
    Updated Jun 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhi-Yi Chin (2024). p4d [Dataset]. https://huggingface.co/datasets/joycenerd/p4d
    Explore at:
    Dataset updated
    Jun 11, 2024
    Authors
    Zhi-Yi Chin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Prompting4Debugging Dataset

    This dataset contains prompts designed to evaluate and challenge the safety mechanisms of generative text-to-image models, with a particular focus on identifying prompts that are likely to produce images containing nudity. Introduced in the 2024 ICML paper Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts, this dataset is not specific to any single approach or model but is intended to test various mitigating… See the full description on the dataset page: https://huggingface.co/datasets/joycenerd/p4d.

  19. Z

    PULSE dataset

    • data.niaid.nih.gov
    • data.europa.eu
    Updated Feb 11, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexandre Esse (2021). PULSE dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3928561
    Explore at:
    Dataset updated
    Feb 11, 2021
    Dataset provided by
    Vladimir Urosevic
    Alexandre Esse
    Description

    Motivation

    This dataset is derived and cleaned from the full PULSE project dataset to share with others data gathered about the users during the project.

    Disclaimer

    Any third party need to respect ethics rules and GDPR and must mention “PULSE DATA H2020 - 727816” in any dissemination activities related to data being exploited. Also, you should provide a link to the project associated website: http://www.project-pulse.eu/

    The data provided in the files is provided as is. Despite our best efforts at filtering out potential issues, some information could be erroneous.

    Description of the dataset

    The only difference with the original dataset comes from anonymised user information.

    The dataset content is described in a dedicated JSON file:

    { "citizen_id": "pseudonymized unique key of each citizen user in the PULSE system", "city_code": { "description": "3-letter city codes taken by convention from IATA codebook of airports and metropolitan areas, as the codebook of global cities in most common and widespread use and therefore adopted as standard in PULSE (since there is currently - in the year 2020 - still no relevant ISO or other standardized codebook of cities uniformly globally adopted and used). Exception is Pavia which does not have its own airport,and nearby Milan/Bergamo airports are not applicable, so the 'PAI' internal code (not existing in original IATA codes) has been devised in PULSE. For cities with multiple airports, IATA metropolitan area codes are used (New York, Paris).", "BCN": "Barcelona", "BHX": "Birmingham", "NYC": "New York", "PAI": "Pavia", "PAR": "Paris", "SIN": "Singapore", "TPE": "Keelung(Taipei)" }, "zip_code": "Zip or postal code (area) within a city, basic default granular territorial/administrative subdivision unit for localization of citizen users by place of residence (in all PULSE cities)", "models": { "asthma_risk_score": "PULSE asthma risk consensus model score, decimal value ranging from 0 to 1", "asthma_risk_score_category": { "description": "Categorized value of the PULSE asthma risk consensus model score, with the following possible category options:", "low": "low asthma risk, score value below 0,05", "medium-low": "medium-low asthma risk, score value from 0,05 and below 0,1", "medium": "medium asthma risk, score value from 0,1 and below 0,15", "medium-high": "medium-high asthma risk, score value from 0,15 and below 0,2", "high": "high asthma risk, score value from 0,2 and higher" }, "T2D_risk_score": "PULSE diabetes type 2 (T2D) risk consensus model score, decimal value ranging from 0 to 1", "T2D_risk_score_category": { "description": "Categorized value of the PULSE diabetes type 2 risk consensus model score, with the following possible category options:", "low": "low T2D risk, score value below 0,05", "medium-low": "medium-low T2D risk, score value from 0,05 and below 0,1", "medium": "medium T2D risk, score value from 0,1 and below 0,15", "medium-high": "medium-high T2D risk, score value from 0,15 and below 0,2", "high": "high T2D risk, score value from 0,2 and below 0,25", "very_high": "very high T2D risk, score value from 0,25 and higher" }, "well-being_score": "PULSE well-being model score, decimal value ranging from -5 to 5", "well-being_score_category": { "description": "Categorized value of the PULSE well-being model score, with the following possible category options:", "low": "low well-being, score value below -0,37", "medium-low": "medium-low well-being, score value from -0,37 and below 0,04", "medium-high": "medium-high well-being, score value from 0,04 and below 0,36", "high": "high well-being, score value from 0,36 and higher" }, "computed_time": "Timestamp (UTC) when each relevant model score value/result had been computed or derived"
    } }

  20. d

    Raster Dataset Model of Oil Shale Resources in the Piceance Basin, Colorado

    • catalog.data.gov
    • data.usgs.gov
    • +2more
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Raster Dataset Model of Oil Shale Resources in the Piceance Basin, Colorado [Dataset]. https://catalog.data.gov/dataset/raster-dataset-model-of-oil-shale-resources-in-the-piceance-basin-colorado
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    U.S. Geological Survey
    Area covered
    Colorado
    Description

    ESRI GRID raster datasets were created to display and quantify oil shale resources for seventeen zones in the Piceance Basin, Colorado as part of a 2009 National Oil Shale Assessment. The oil shale zones in descending order are: Bed 44, A Groove, Mahogany Zone, B Groove, R-6, L-5, R-5, L-4, R-4, L-3, R-3, L-2, R-2, L-1, R-1, L-0, and R-0. Each raster cell represents a one-acre square of the land surface and contains values for either oil yield in barrels per acre, gallons per ton, or isopach thickness, in feet, as defined by the grid name: _b (barrels per acre), _g (gallons per ton), and _i (isopach thickness) where "" can be replaced by the name of the oil shale zone.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Bright Data (2024). Machine Learning Dataset [Dataset]. https://brightdata.com/products/datasets/machine-learning
Organization logo

Machine Learning Dataset

Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset updated
Jun 19, 2024
Dataset authored and provided by
Bright Datahttps://brightdata.com/
License

https://brightdata.com/licensehttps://brightdata.com/license

Area covered
Worldwide
Description

Utilize our machine learning datasets to develop and validate your models. Our datasets are designed to support a variety of machine learning applications, from image recognition to natural language processing and recommendation systems. You can access a comprehensive dataset or tailor a subset to fit your specific requirements, using data from a combination of various sources and websites, including custom ones. Popular use cases include model training and validation, where the dataset can be used to ensure robust performance across different applications. Additionally, the dataset helps in algorithm benchmarking by providing extensive data to test and compare various machine learning algorithms, identifying the most effective ones for tasks such as fraud detection, sentiment analysis, and predictive maintenance. Furthermore, it supports feature engineering by allowing you to uncover significant data attributes, enhancing the predictive accuracy of your machine learning models for applications like customer segmentation, personalized marketing, and financial forecasting.

Search
Clear search
Close search
Google apps
Main menu