https://brightdata.com/licensehttps://brightdata.com/license
Utilize our machine learning datasets to develop and validate your models. Our datasets are designed to support a variety of machine learning applications, from image recognition to natural language processing and recommendation systems. You can access a comprehensive dataset or tailor a subset to fit your specific requirements, using data from a combination of various sources and websites, including custom ones. Popular use cases include model training and validation, where the dataset can be used to ensure robust performance across different applications. Additionally, the dataset helps in algorithm benchmarking by providing extensive data to test and compare various machine learning algorithms, identifying the most effective ones for tasks such as fraud detection, sentiment analysis, and predictive maintenance. Furthermore, it supports feature engineering by allowing you to uncover significant data attributes, enhancing the predictive accuracy of your machine learning models for applications like customer segmentation, personalized marketing, and financial forecasting.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains 5,000 custom-labeled text samples (2,500 human-written, 2,500 AI-generated) designed for binary classification of human vs AI content. Text was preprocessed using TF-IDF and used to train multiple ML classifiers (LogReg, SVC, NB, RF) with high accuracy. The dataset is balanced, ready-to-use, and ideal for text classification, model explainability, or ethical AI applications.
File Name | Description |
---|---|
your_dataset_5000.csv | 5,000 labeled text samples: 2,500 human, 2,500 AI |
text_classifier_5000.joblib | Serialized trained classifier model (LogReg, top performer) |
Human vs AI Custom Dataset.ipynb | Main notebook: preprocessing, modeling, evaluation |
README.md | Overview and usage instructions for the dataset |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Pegion Model V.2 is a dataset for object detection tasks - it contains Pegion annotations for 998 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Software Model simulations were conducted using WRF version 3.8.1 (available at https://github.com/NCAR/WRFV3) and CMAQ version 5.2.1 (available at https://github.com/USEPA/CMAQ). The meteorological and concentration fields created using these models are too large to archive on ScienceHub, approximately 1 TB, and are archived on EPA’s high performance computing archival system (ASM) at /asm/MOD3APP/pcc/02.NOAH.v.CLM.v.PX/. Figures Figures 1 – 6 and Figure 8: Created using the NCAR Command Language (NCL) scripts (https://www.ncl.ucar.edu/get_started.shtml). NCLD code can be downloaded from the NCAR website (https://www.ncl.ucar.edu/Download/) at no cost. The data used for these figures are archived on EPA’s ASM system and are available upon request. Figures 7, 8b-c, 8e-f, 8h-i, and 9 were created using the AMET utility developed by U.S. EPA/ORD. AMET can be freely downloaded and used at https://github.com/USEPA/AMET. The modeled data paired in space and time provided in this archive can be used to recreate these figures. The data contained in the compressed zip files are organized in comma delimited files with descriptive headers or space delimited files that match tabular data in the manuscript. The data dictionary provides additional information about the files and their contents. This dataset is associated with the following publication: Campbell, P., J. Bash, and T. Spero. Updates to the Noah Land Surface Model in WRF‐CMAQ to Improve Simulated Meteorology, Air Quality, and Deposition. Journal of Advances in Modeling Earth Systems. John Wiley & Sons, Inc., Hoboken, NJ, USA, 11(1): 231-256, (2019).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
CAR OR NOT CAR MODEL is a dataset for object detection tasks - it contains Car Notcar annotations for 2,849 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
AI vs Deepfake vs Real
AI vs Deepfake vs Real is a dataset designed for image classification, distinguishing between artificial, deepfake, and real images. This dataset includes a diverse collection of high-quality images to enhance classification accuracy and improve the model’s overall efficiency. By providing a well-balanced dataset, it aims to support the development of more robust AI-generated and deepfake detection models.
Label Mappings
Mapping of IDs to… See the full description on the dataset page: https://huggingface.co/datasets/prithivMLmods/AI-vs-Deepfake-vs-Real.
This file contains the data set used to develop a random forest model predict background specific conductivity for stream segments in the contiguous United States. This Excel readable file contains 56 columns of parameters evaluated during development. The data dictionary provides the definition of the abbreviations and the measurement units. Each row is a unique sample described as R** which indicates the NHD Hydrologic Unit (underscore), up to a 7-digit COMID, (underscore) sequential sample month. To develop models that make stream-specific predictions across the contiguous United States, we used StreamCat data set and process (Hill et al. 2016; https://github.com/USEPA/StreamCat). The StreamCat data set is based on a network of stream segments from NHD+ (McKay et al. 2012). These stream segments drain an average area of 3.1 km2 and thus define the spatial grain size of this data set. The data set consists of minimally disturbed sites representing the natural variation in environmental conditions that occur in the contiguous 48 United States. More than 2.4 million SC observations were obtained from STORET (USEPA 2016b), state natural resource agencies, the U.S. Geological Survey (USGS) National Water Information System (NWIS) system (USGS 2016), and data used in Olson and Hawkins (2012) (Table S1). Data include observations made between 1 January 2001 and 31 December 2015 thus coincident with Moderate Resolution Imaging Spectroradiometer (MODIS) satellite data (https://modis.gsfc.nasa.gov/data/). Each observation was related to the nearest stream segment in the NHD+. Data were limited to one observation per stream segment per month. SC observations with ambiguous locations and repeat measurements along a stream segment in the same month were discarded. Using estimates of anthropogenic stress derived from the StreamCat database (Hill et al. 2016), segments were selected with minimal amounts of human activity (Stoddard et al. 2006) using criteria developed for each Level II Ecoregion (Omernik and Griffith 2014). Segments were considered as potentially minimally stressed where watersheds had 0 - 0.5% impervious surface, 0 – 5% urban, 0 – 10% agriculture, and population densities from 0.8 – 30 people/km2 (Table S3). Watersheds with observations with large residuals in initial models were identified and inspected for evidence of other human activities not represented in StreamCat (e.g., mining, logging, grazing, or oil/gas extraction). Observations were removed from disturbed watersheds, with a tidal influence or unusual geologic conditions such as hot springs. About 5% of SC observations in each National Rivers and Stream Assessment (NRSA) region were then randomly selected as independent validation data. The remaining observations became the large training data set for model calibration. This dataset is associated with the following publication: Olson, J., and S. Cormier. Modeling spatial and temporal variation in natural background specific conductivity. ENVIRONMENTAL SCIENCE & TECHNOLOGY. American Chemical Society, Washington, DC, USA, 53(8): 4316-4325, (2019).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the datasets and the pre-trained model associated with GraphaRNA, a diffusion-based graph neural network for RNA 3D structure prediction. The data is organized into multiple files, each providing key resources for training, validation, and testing the model, as well as a pre-trained model ready for inference.
rRNA_tRNA.tar.gz:
non_rRNA_tRNA.tar.gz:
train-pkl.tar.gz:
val-pkl.tar.gz:
test-pkl.tar.gz:
model_epoch_800.tar.gz:
model_epoch_800.tar.gz
can be used to run inference on new RNA data or to reproduce results from the associated paper.train-pkl.tar.gz
contains data that can be used to retrain the GraphaRNA model from scratch.val-pkl.tar.gz
can be used to validate the model during or after training.test-pkl.tar.gz
to evaluate the model's performance on RNA types that it wasn't trained on (non-rRNA and non-tRNA).model_epoch_800.tar.gz
is ready for inference on new RNA sequences.If you use this dataset or the pre-trained model in your research, please cite the associated paper (linked here once published).
Polygons: 34814 Vertices: 19011
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Antibacterial drugs (AD) change the metabolic status of bacteria, contributing to bacterial death. However, antibiotic resistance and the emergence of multidrug-resistant bacteria increase interest in understanding metabolic network (MN) mutations and the interaction of AD vs MN. In this study, we employed the IFPTML = Information Fusion (IF) + Perturbation Theory (PT) + Machine Learning (ML) algorithm on a huge dataset from the ChEMBL database, which contains
155,000 AD assays vs >40 MNs of multiple bacteria species. We built a linear discriminant analysis (LDA) and 17 ML models centered on the linear index and based on atoms to predict antibacterial compounds. The IFPTML-LDA model presented the following results for the training subset: specificity (Sp) = 76% out of 70,000 cases, sensitivity (Sn) = 70%, and Accuracy (Acc) = 73%. The same model also presented the following results for the validation subsets: Sp = 76%, Sn = 70%, and Acc = 73.1%. Among the IFPTML nonlinear models, the k nearest neighbors (KNN) showed the best results with Sn = 99.2%, Sp = 95.5%, Acc = 97.4%, and Area Under Receiver Operating Characteristic (AUROC) = 0.998 in training sets. In the validation series, the Random Forest had the best results: Sn = 93.96% and Sp = 87.02% (AUROC = 0.945). The IFPTML linear and nonlinear models regarding the ADs vs MNs have good statistical parameters, and they could contribute toward finding new metabolic mutations in antibiotic resistance and reducing time/costs in antibacterial drug research.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This dataset contains scraped and processed text from roughly 100 years of articles published in the Wiley journal Science Education (formerly General Science Quarterly). This text has been cleaned and filtered in preparation for analysis using natural language processing techniques, particularly topic modeling with latent Dirichlet allocation (LDA). We also include a Jupyter Notebook illustrating how one can use LDA to analyze this dataset and extract latent topics from it, as well as analyze the rise and fall of those topics over the history of the journal.
The articles were downloaded and scraped in December of 2019. Only non-duplicate articles with a listed author (according to the CrossRef metadata database) were included, and due to missing data and text recognition issues we excluded all articles published prior to 1922. This resulted in 5577 articles in total being included in the dataset. The text of these articles was then cleaned in the following way:
After filtering, each document was then turned into a list of individual words (or tokens) which were then collected and saved (using the python pickle format) into the file scied_words_bigrams_V5.pkl.
In addition to this file, we have also included the following files:
This dataset is shared under the terms of the Wiley Text and Data Mining Agreement, which allows users to share text and data mining output for non-commercial research purposes. Any questions or comments can be directed to Tor Ole Odden, t.o.odden@fys.uio.no.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Buena Vs Mala is a dataset for object detection tasks - it contains Manzanas annotations for 380 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
DEEPEN stands for DE-risking Exploration of geothermal Plays in magmatic ENvironments.
As part of the development of the DEEPEN 3D play fairway analysis (PFA) methodology for magmatic plays (conventional hydrothermal, superhot EGS, and supercritical), index models needed to be developed to map values in geoscientific exploration datasets to favorability index values. This GDR submission includes those index models.
Index models were created by binning values in exploration datasets into chunks based on their favorability, and then applying a number between 0 and 5 to each chunk, where 0 represents very unfavorable data values and 5 represents very favorable data values. To account for differences in how exploration methods are used to detect each play component, separate index models are produced for each exploration method for each component of each play type.
Index models were created using histograms of the distributions of each exploration dataset in combination with literature and input from experts about what combinations of geophysical, geological, and geochemical signatures are considered favorable at Newberry. This is in attempt to create similar sized bins based on the current understanding of how different anomalies map to favorable areas for the different types of geothermal plays (i.e., conventional hydrothermal, superhot EGS, and supercritical). For example, an area of partial melt would likely appear as an area of low density, high conductivity, low vp, and high vp/vs. This means that these target anomalies would be given high (4 or 5) index values for the purpose of imaging the heat source. To account for differences in how exploration methods are used to detect each play component, separate index models are produced for each exploration method for each component of each play type.
Index models were produced for the following datasets: - Geologic model - Alteration model - vp/vs - vp - vs - Temperature model - Seismicity (density*magnitude) - Density - Resistivity - Fault distance - Earthquake cutoff depth model
The data we used for this study include species occurrence data (n=15 species), climate data and predictions, an expert opinion questionnaire, and species masks that represented the model domain for each species. For this data release, we include the results of the expert opinion questionnaire and the species model domains (or masks). We developed an expert opinion questionnaire to gather information regarding expert opinion regarding the importance of climate variables in determining a species geographic range. The species masks, or model domains, were defined separately for each species using a variation of the “target-group” approach (Phillips et al. 2009), where the domain was determine using convex polygons including occurrence data for at least three phylogenetically related and similar species (Watling et al. 2012). The species occurrence data, climate data, and climate predictions are freely available online, and therefore not included in this data release. The species occurrence data were obtained primarily from the online database Global Biodiversity Information Facility (GBIF; http://www.gbif.org/), and from scientific literature (Watling et al. 2011). Climate data were obtained from the WorldClim database (Hijmans et al. 2005) and climate predictions were obtained from the Center for Ocean-Atmosphere Prediction Studies (COAPS) at Florida State University (https://floridaclimateinstitute.org/resources/data-sets/regional-downscaling). See metadata for references.
https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/
Rapidata Image Generation Coherence Dataset
This dataset was collected in ~4 Days using the Rapidata Python API, accessible to anyone and ideal for large scale data annotation. Explore our latest model rankings on our website. If you get value from this dataset and would like to see more in the future, please consider liking it.
Overview
One of the largest human annotated coherence datasets for text-to-image models, this release contains over 1,200,000 human… See the full description on the dataset page: https://huggingface.co/datasets/Rapidata/human-coherence-preferences-images.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The development of machine-learning models for atomic-scale simulations has benefitted tremendously from the large databases of materials and molecular properties computed in the past two decades using electronic-structure calculations. More recently, these databases have made it possible to train “universal” models that aim at making accurate predictions for arbitrary atomic geometries and compositions. The construction of many of these databases was however in itself aimed at materials discovery, and therefore targeted primarily to sample stable, or at least plausible, structures and to make the most accurate predictions for each compound – e.g. adjusting the calculation details to the material at hand. Here we introduce a dataset designed specifically to train models that can provide reasonable predictions for arbitrary structures, and that therefore follows a different philosophy. Starting from relatively small sets of stable structures, the dataset is built to contain “massive atomic diversity” (MAD) by aggressively distorting these configurations, with near-complete disregard for the stability of the resulting configurations. The electronic structure details, on the other hand, are chosen to maximize consistency rather than to obtain the most accurate prediction for
a given structure, or to minimize computational effort. The MAD dataset we present here, despite containing fewer than 100k structures, has already been shown to enable training universal interatomic potentials that are competitive with models trained on traditional datasets with two to three orders of magnitude more structures. We describe in detail the philosophy and details of the construction of the MAD dataset. We also introduce a low-dimensional structural latent space that allows us to compare it with other popular datasets, and that can also be used as a general-purpose materials cartography tool.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset to train models to detect forest fires or other fire-related incidents. Has a folder "fire" with 5853 images of fire occurring in many different situations, and a folder "not_fire" with 9755 common images: Urban spaces, forests, deserts, rivers, oceans, animals, people, all sorta thing.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Prompting4Debugging Dataset
This dataset contains prompts designed to evaluate and challenge the safety mechanisms of generative text-to-image models, with a particular focus on identifying prompts that are likely to produce images containing nudity. Introduced in the 2024 ICML paper Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts, this dataset is not specific to any single approach or model but is intended to test various mitigating… See the full description on the dataset page: https://huggingface.co/datasets/joycenerd/p4d.
Motivation
This dataset is derived and cleaned from the full PULSE project dataset to share with others data gathered about the users during the project.
Disclaimer
Any third party need to respect ethics rules and GDPR and must mention “PULSE DATA H2020 - 727816” in any dissemination activities related to data being exploited. Also, you should provide a link to the project associated website: http://www.project-pulse.eu/
The data provided in the files is provided as is. Despite our best efforts at filtering out potential issues, some information could be erroneous.
Description of the dataset
The only difference with the original dataset comes from anonymised user information.
The dataset content is described in a dedicated JSON file:
{
"citizen_id": "pseudonymized unique key of each citizen user in the PULSE system",
"city_code": {
"description": "3-letter city codes taken by convention from IATA codebook of airports and metropolitan areas, as the codebook of global cities in most common and widespread use and therefore adopted as standard in PULSE (since there is currently - in the year 2020 - still no relevant ISO or other standardized codebook of cities uniformly globally adopted and used). Exception is Pavia which does not have its own airport,and nearby Milan/Bergamo airports are not applicable, so the 'PAI' internal code (not existing in original IATA codes) has been devised in PULSE. For cities with multiple airports, IATA metropolitan area codes are used (New York, Paris).",
"BCN": "Barcelona",
"BHX": "Birmingham",
"NYC": "New York",
"PAI": "Pavia",
"PAR": "Paris",
"SIN": "Singapore",
"TPE": "Keelung(Taipei)"
},
"zip_code": "Zip or postal code (area) within a city, basic default granular territorial/administrative subdivision unit for localization of citizen users by place of residence (in all PULSE cities)",
"models": {
"asthma_risk_score": "PULSE asthma risk consensus model score, decimal value ranging from 0 to 1",
"asthma_risk_score_category": {
"description": "Categorized value of the PULSE asthma risk consensus model score, with the following possible category options:",
"low": "low asthma risk, score value below 0,05",
"medium-low": "medium-low asthma risk, score value from 0,05 and below 0,1",
"medium": "medium asthma risk, score value from 0,1 and below 0,15",
"medium-high": "medium-high asthma risk, score value from 0,15 and below 0,2",
"high": "high asthma risk, score value from 0,2 and higher"
},
"T2D_risk_score": "PULSE diabetes type 2 (T2D) risk consensus model score, decimal value ranging from 0 to 1",
"T2D_risk_score_category": {
"description": "Categorized value of the PULSE diabetes type 2 risk consensus model score, with the following possible category options:",
"low": "low T2D risk, score value below 0,05",
"medium-low": "medium-low T2D risk, score value from 0,05 and below 0,1",
"medium": "medium T2D risk, score value from 0,1 and below 0,15",
"medium-high": "medium-high T2D risk, score value from 0,15 and below 0,2",
"high": "high T2D risk, score value from 0,2 and below 0,25",
"very_high": "very high T2D risk, score value from 0,25 and higher"
},
"well-being_score": "PULSE well-being model score, decimal value ranging from -5 to 5",
"well-being_score_category": {
"description": "Categorized value of the PULSE well-being model score, with the following possible category options:",
"low": "low well-being, score value below -0,37",
"medium-low": "medium-low well-being, score value from -0,37 and below 0,04",
"medium-high": "medium-high well-being, score value from 0,04 and below 0,36",
"high": "high well-being, score value from 0,36 and higher"
},
"computed_time": "Timestamp (UTC) when each relevant model score value/result had been computed or derived"
}
}
ESRI GRID raster datasets were created to display and quantify oil shale resources for seventeen zones in the Piceance Basin, Colorado as part of a 2009 National Oil Shale Assessment. The oil shale zones in descending order are: Bed 44, A Groove, Mahogany Zone, B Groove, R-6, L-5, R-5, L-4, R-4, L-3, R-3, L-2, R-2, L-1, R-1, L-0, and R-0. Each raster cell represents a one-acre square of the land surface and contains values for either oil yield in barrels per acre, gallons per ton, or isopach thickness, in feet, as defined by the grid name: _b (barrels per acre), _g (gallons per ton), and _i (isopach thickness) where "" can be replaced by the name of the oil shale zone.
https://brightdata.com/licensehttps://brightdata.com/license
Utilize our machine learning datasets to develop and validate your models. Our datasets are designed to support a variety of machine learning applications, from image recognition to natural language processing and recommendation systems. You can access a comprehensive dataset or tailor a subset to fit your specific requirements, using data from a combination of various sources and websites, including custom ones. Popular use cases include model training and validation, where the dataset can be used to ensure robust performance across different applications. Additionally, the dataset helps in algorithm benchmarking by providing extensive data to test and compare various machine learning algorithms, identifying the most effective ones for tasks such as fraud detection, sentiment analysis, and predictive maintenance. Furthermore, it supports feature engineering by allowing you to uncover significant data attributes, enhancing the predictive accuracy of your machine learning models for applications like customer segmentation, personalized marketing, and financial forecasting.